You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/06/24 22:02:30 UTC

[DISCUSS] Ongoing LZ4 problems with Parquet files

hi folks,

(cross-posting to dev@arrow and dev@parquet since there are
stakeholders in both places)

It seems there are still problems at least with the C++ implementation
of LZ4 compression in Parquet files

https://issues.apache.org/jira/browse/PARQUET-1241
https://issues.apache.org/jira/browse/PARQUET-1878

If these problems cannot be resolved, I am going to recommend that we
disable use of LZ4 in the Parquet C++ library until these things can
be properly tested and validated across different implementations.
Thoughts? We're within weeks of the next Apache Arrow release so if
we're going to disable LZ4-for-Parquet it needs to happen soon.

Thanks
Wes

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Patrick Pai <pa...@gmail.com>.
I'll volunteer to disable writing/reading LZ4. I'll submit a patch in the next few days.

On 2020/07/12 22:11:33, Wes McKinney <we...@gmail.com> wrote: 
> Since there hasn't been other movement on this, we need to disable
> writing LZ4-compressed files until this can be investigated more
> thoroughly. If someone wants to submit a patch that would be helpful
> otherwise I can take a look in the next couple days
> 
> On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Well, it depends how important speed is, but LZ4 has extremely fast
> > decompression, even compared to Snappy:
> > https://github.com/lz4/lz4#benchmarks
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> > > At least for us, the advantages of Parquet are speed and interoperability
> > > in the context of longer-term data storage, so I would tend to say
> > > "reasonably conservative".
> > >
> > > Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
> > > écrit :
> > >
> > >>
> > >> I don't have a sense of how conservative Parquet users generally are.
> > >> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> > >> format, or would people just not use it?
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >> On Tue, 30 Jun 2020 14:33:17 +0200
> > >> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> > >>> I'm also in favor of disabling support for now. Having to deal with
> > >> broken files or the detection of various incompatible implementations in
> > >> the long-term will harm more than not supporting LZ4 for a while. Snappy is
> > >> generally more used than LZ4 in this category as it has been available
> > >> since the inception of Parquet and thus should be considered as a viable
> > >> alternative.
> > >>>
> > >>> Cheers
> > >>> Uwe
> > >>>
> > >>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > >>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> > >> wrote:
> > >>>>>
> > >>>>>
> > >>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > >>>>>> hi folks,
> > >>>>>>
> > >>>>>> (cross-posting to dev@arrow and dev@parquet since there are
> > >>>>>> stakeholders in both places)
> > >>>>>>
> > >>>>>> It seems there are still problems at least with the C++
> > >> implementation
> > >>>>>> of LZ4 compression in Parquet files
> > >>>>>>
> > >>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
> > >>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
> > >>>>>
> > >>>>> I don't have any particular opinion on how to solve the LZ4 issue,
> > >> but
> > >>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
> > >>>>> compression algorithms available, and they span different parts of
> > >> the
> > >>>>> speed/compression spectrum, so it would be a pity to disable one of
> > >> them.
> > >>>>
> > >>>> It's true, however I think it's worse to write LZ4-compressed files
> > >>>> that cannot be read by other Parquet implementations (if that's what's
> > >>>> happening as I understand it?). If we are indeed shipping something
> > >>>> broken then we either should fix it or disable it until it can be
> > >>>> fixed.
> > >>>>
> > >>>>> Regards
> > >>>>>
> > >>>>> Antoine.
> > >>>>
> > >>>
> > >>
> > >>
> > >>
> > >>
> > >
> 

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <an...@python.org>.
Agreed, but even then, if some Parquet files are generated inside of a
well-defined system which only needs to be interoperable with itself,
it's not necessaril harmful to allow LZ4 compression when writing new files.

Regards

Antoine.


Le 13/07/2020 à 17:07, Wes McKinney a écrit :
> I didn’t say to disable _reading_ them, only writing them.
> 
> On Mon, Jul 13, 2020 at 4:15 AM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> I'm not sure that's a good idea.  There are probably Parquet files that
>> are only ever used with the Arrow implementation (Arrow C++, Arrow
>> Python, Arrow R...).
>>
>> I admit I'm also not terribly bothered about this, since the Parquet
>> community itself doesn't seem to care much about the issue (it has been
>> known for a long time and they could have solved it long ago).
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 13/07/2020 à 00:11, Wes McKinney a écrit :
>>> Since there hasn't been other movement on this, we need to disable
>>> writing LZ4-compressed files until this can be investigated more
>>> thoroughly. If someone wants to submit a patch that would be helpful
>>> otherwise I can take a look in the next couple days
>>>
>>> On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org>
>> wrote:
>>>>
>>>>
>>>> Well, it depends how important speed is, but LZ4 has extremely fast
>>>> decompression, even compared to Snappy:
>>>> https://github.com/lz4/lz4#benchmarks
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
>>>>> At least for us, the advantages of Parquet are speed and
>> interoperability
>>>>> in the context of longer-term data storage, so I would tend to say
>>>>> "reasonably conservative".
>>>>>
>>>>> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net>
>> a
>>>>> écrit :
>>>>>
>>>>>>
>>>>>> I don't have a sense of how conservative Parquet users generally are.
>>>>>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
>>>>>> format, or would people just not use it?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Antoine.
>>>>>>
>>>>>>
>>>>>> On Tue, 30 Jun 2020 14:33:17 +0200
>>>>>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
>>>>>>> I'm also in favor of disabling support for now. Having to deal with
>>>>>> broken files or the detection of various incompatible implementations
>> in
>>>>>> the long-term will harm more than not supporting LZ4 for a while.
>> Snappy is
>>>>>> generally more used than LZ4 in this category as it has been available
>>>>>> since the inception of Parquet and thus should be considered as a
>> viable
>>>>>> alternative.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Uwe
>>>>>>>
>>>>>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
>>>>>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
>>>>>>>>>> hi folks,
>>>>>>>>>>
>>>>>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
>>>>>>>>>> stakeholders in both places)
>>>>>>>>>>
>>>>>>>>>> It seems there are still problems at least with the C++
>>>>>> implementation
>>>>>>>>>> of LZ4 compression in Parquet files
>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
>>>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
>>>>>>>>>
>>>>>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
>>>>>> but
>>>>>>>>> I'd like to mention that LZ4 and ZStandard are the two most
>> efficient
>>>>>>>>> compression algorithms available, and they span different parts of
>>>>>> the
>>>>>>>>> speed/compression spectrum, so it would be a pity to disable one of
>>>>>> them.
>>>>>>>>
>>>>>>>> It's true, however I think it's worse to write LZ4-compressed files
>>>>>>>> that cannot be read by other Parquet implementations (if that's
>> what's
>>>>>>>> happening as I understand it?). If we are indeed shipping something
>>>>>>>> broken then we either should fix it or disable it until it can be
>>>>>>>> fixed.
>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Antoine.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
> 

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Wes McKinney <we...@gmail.com>.
I didn’t say to disable _reading_ them, only writing them.

On Mon, Jul 13, 2020 at 4:15 AM Antoine Pitrou <an...@python.org> wrote:

>
> I'm not sure that's a good idea.  There are probably Parquet files that
> are only ever used with the Arrow implementation (Arrow C++, Arrow
> Python, Arrow R...).
>
> I admit I'm also not terribly bothered about this, since the Parquet
> community itself doesn't seem to care much about the issue (it has been
> known for a long time and they could have solved it long ago).
>
> Regards
>
> Antoine.
>
>
> Le 13/07/2020 à 00:11, Wes McKinney a écrit :
> > Since there hasn't been other movement on this, we need to disable
> > writing LZ4-compressed files until this can be investigated more
> > thoroughly. If someone wants to submit a patch that would be helpful
> > otherwise I can take a look in the next couple days
> >
> > On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org>
> wrote:
> >>
> >>
> >> Well, it depends how important speed is, but LZ4 has extremely fast
> >> decompression, even compared to Snappy:
> >> https://github.com/lz4/lz4#benchmarks
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> >>> At least for us, the advantages of Parquet are speed and
> interoperability
> >>> in the context of longer-term data storage, so I would tend to say
> >>> "reasonably conservative".
> >>>
> >>> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net>
> a
> >>> écrit :
> >>>
> >>>>
> >>>> I don't have a sense of how conservative Parquet users generally are.
> >>>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> >>>> format, or would people just not use it?
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>> On Tue, 30 Jun 2020 14:33:17 +0200
> >>>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> >>>>> I'm also in favor of disabling support for now. Having to deal with
> >>>> broken files or the detection of various incompatible implementations
> in
> >>>> the long-term will harm more than not supporting LZ4 for a while.
> Snappy is
> >>>> generally more used than LZ4 in this category as it has been available
> >>>> since the inception of Parquet and thus should be considered as a
> viable
> >>>> alternative.
> >>>>>
> >>>>> Cheers
> >>>>> Uwe
> >>>>>
> >>>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> >>>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> >>>>>>>> hi folks,
> >>>>>>>>
> >>>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
> >>>>>>>> stakeholders in both places)
> >>>>>>>>
> >>>>>>>> It seems there are still problems at least with the C++
> >>>> implementation
> >>>>>>>> of LZ4 compression in Parquet files
> >>>>>>>>
> >>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
> >>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
> >>>>>>>
> >>>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
> >>>> but
> >>>>>>> I'd like to mention that LZ4 and ZStandard are the two most
> efficient
> >>>>>>> compression algorithms available, and they span different parts of
> >>>> the
> >>>>>>> speed/compression spectrum, so it would be a pity to disable one of
> >>>> them.
> >>>>>>
> >>>>>> It's true, however I think it's worse to write LZ4-compressed files
> >>>>>> that cannot be read by other Parquet implementations (if that's
> what's
> >>>>>> happening as I understand it?). If we are indeed shipping something
> >>>>>> broken then we either should fix it or disable it until it can be
> >>>>>> fixed.
> >>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Krisztián Szűcs <sz...@gmail.com>.
On Mon, Jul 13, 2020 at 11:15 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> I'm not sure that's a good idea.  There are probably Parquet files that
> are only ever used with the Arrow implementation (Arrow C++, Arrow
> Python, Arrow R...).

I tend to agree with Antoine here. As an alternative to disabling the
compression entirely we could explicitly raise a warning and document
the problem.

>
> I admit I'm also not terribly bothered about this, since the Parquet
> community itself doesn't seem to care much about the issue (it has been
> known for a long time and they could have solved it long ago).
>
> Regards
>
> Antoine.
>
>
> Le 13/07/2020 à 00:11, Wes McKinney a écrit :
> > Since there hasn't been other movement on this, we need to disable
> > writing LZ4-compressed files until this can be investigated more
> > thoroughly. If someone wants to submit a patch that would be helpful
> > otherwise I can take a look in the next couple days
> >
> > On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org> wrote:
> >>
> >>
> >> Well, it depends how important speed is, but LZ4 has extremely fast
> >> decompression, even compared to Snappy:
> >> https://github.com/lz4/lz4#benchmarks
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> >>> At least for us, the advantages of Parquet are speed and interoperability
> >>> in the context of longer-term data storage, so I would tend to say
> >>> "reasonably conservative".
> >>>
> >>> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
> >>> écrit :
> >>>
> >>>>
> >>>> I don't have a sense of how conservative Parquet users generally are.
> >>>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> >>>> format, or would people just not use it?
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>> On Tue, 30 Jun 2020 14:33:17 +0200
> >>>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> >>>>> I'm also in favor of disabling support for now. Having to deal with
> >>>> broken files or the detection of various incompatible implementations in
> >>>> the long-term will harm more than not supporting LZ4 for a while. Snappy is
> >>>> generally more used than LZ4 in this category as it has been available
> >>>> since the inception of Parquet and thus should be considered as a viable
> >>>> alternative.
> >>>>>
> >>>>> Cheers
> >>>>> Uwe
> >>>>>
> >>>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> >>>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> >>>>>>>> hi folks,
> >>>>>>>>
> >>>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
> >>>>>>>> stakeholders in both places)
> >>>>>>>>
> >>>>>>>> It seems there are still problems at least with the C++
> >>>> implementation
> >>>>>>>> of LZ4 compression in Parquet files
> >>>>>>>>
> >>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
> >>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
> >>>>>>>
> >>>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
> >>>> but
> >>>>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
> >>>>>>> compression algorithms available, and they span different parts of
> >>>> the
> >>>>>>> speed/compression spectrum, so it would be a pity to disable one of
> >>>> them.
> >>>>>>
> >>>>>> It's true, however I think it's worse to write LZ4-compressed files
> >>>>>> that cannot be read by other Parquet implementations (if that's what's
> >>>>>> happening as I understand it?). If we are indeed shipping something
> >>>>>> broken then we either should fix it or disable it until it can be
> >>>>>> fixed.
> >>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <an...@python.org>.
I'm not sure that's a good idea.  There are probably Parquet files that
are only ever used with the Arrow implementation (Arrow C++, Arrow
Python, Arrow R...).

I admit I'm also not terribly bothered about this, since the Parquet
community itself doesn't seem to care much about the issue (it has been
known for a long time and they could have solved it long ago).

Regards

Antoine.


Le 13/07/2020 à 00:11, Wes McKinney a écrit :
> Since there hasn't been other movement on this, we need to disable
> writing LZ4-compressed files until this can be investigated more
> thoroughly. If someone wants to submit a patch that would be helpful
> otherwise I can take a look in the next couple days
> 
> On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> Well, it depends how important speed is, but LZ4 has extremely fast
>> decompression, even compared to Snappy:
>> https://github.com/lz4/lz4#benchmarks
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
>>> At least for us, the advantages of Parquet are speed and interoperability
>>> in the context of longer-term data storage, so I would tend to say
>>> "reasonably conservative".
>>>
>>> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
>>> écrit :
>>>
>>>>
>>>> I don't have a sense of how conservative Parquet users generally are.
>>>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
>>>> format, or would people just not use it?
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> On Tue, 30 Jun 2020 14:33:17 +0200
>>>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
>>>>> I'm also in favor of disabling support for now. Having to deal with
>>>> broken files or the detection of various incompatible implementations in
>>>> the long-term will harm more than not supporting LZ4 for a while. Snappy is
>>>> generally more used than LZ4 in this category as it has been available
>>>> since the inception of Parquet and thus should be considered as a viable
>>>> alternative.
>>>>>
>>>>> Cheers
>>>>> Uwe
>>>>>
>>>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
>>>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
>>>>>>>> hi folks,
>>>>>>>>
>>>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
>>>>>>>> stakeholders in both places)
>>>>>>>>
>>>>>>>> It seems there are still problems at least with the C++
>>>> implementation
>>>>>>>> of LZ4 compression in Parquet files
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
>>>>>>>
>>>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
>>>> but
>>>>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
>>>>>>> compression algorithms available, and they span different parts of
>>>> the
>>>>>>> speed/compression spectrum, so it would be a pity to disable one of
>>>> them.
>>>>>>
>>>>>> It's true, however I think it's worse to write LZ4-compressed files
>>>>>> that cannot be read by other Parquet implementations (if that's what's
>>>>>> happening as I understand it?). If we are indeed shipping something
>>>>>> broken then we either should fix it or disable it until it can be
>>>>>> fixed.
>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Antoine.
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Wes McKinney <we...@gmail.com>.
Since there hasn't been other movement on this, we need to disable
writing LZ4-compressed files until this can be investigated more
thoroughly. If someone wants to submit a patch that would be helpful
otherwise I can take a look in the next couple days

On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Well, it depends how important speed is, but LZ4 has extremely fast
> decompression, even compared to Snappy:
> https://github.com/lz4/lz4#benchmarks
>
> Regards
>
> Antoine.
>
>
> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> > At least for us, the advantages of Parquet are speed and interoperability
> > in the context of longer-term data storage, so I would tend to say
> > "reasonably conservative".
> >
> > Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
> > écrit :
> >
> >>
> >> I don't have a sense of how conservative Parquet users generally are.
> >> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> >> format, or would people just not use it?
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> On Tue, 30 Jun 2020 14:33:17 +0200
> >> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> >>> I'm also in favor of disabling support for now. Having to deal with
> >> broken files or the detection of various incompatible implementations in
> >> the long-term will harm more than not supporting LZ4 for a while. Snappy is
> >> generally more used than LZ4 in this category as it has been available
> >> since the inception of Parquet and thus should be considered as a viable
> >> alternative.
> >>>
> >>> Cheers
> >>> Uwe
> >>>
> >>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> >>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> >> wrote:
> >>>>>
> >>>>>
> >>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> >>>>>> hi folks,
> >>>>>>
> >>>>>> (cross-posting to dev@arrow and dev@parquet since there are
> >>>>>> stakeholders in both places)
> >>>>>>
> >>>>>> It seems there are still problems at least with the C++
> >> implementation
> >>>>>> of LZ4 compression in Parquet files
> >>>>>>
> >>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
> >>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
> >>>>>
> >>>>> I don't have any particular opinion on how to solve the LZ4 issue,
> >> but
> >>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
> >>>>> compression algorithms available, and they span different parts of
> >> the
> >>>>> speed/compression spectrum, so it would be a pity to disable one of
> >> them.
> >>>>
> >>>> It's true, however I think it's worse to write LZ4-compressed files
> >>>> that cannot be read by other Parquet implementations (if that's what's
> >>>> happening as I understand it?). If we are indeed shipping something
> >>>> broken then we either should fix it or disable it until it can be
> >>>> fixed.
> >>>>
> >>>>> Regards
> >>>>>
> >>>>> Antoine.
> >>>>
> >>>
> >>
> >>
> >>
> >>
> >

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <an...@python.org>.
Well, it depends how important speed is, but LZ4 has extremely fast
decompression, even compared to Snappy:
https://github.com/lz4/lz4#benchmarks

Regards

Antoine.


Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> At least for us, the advantages of Parquet are speed and interoperability
> in the context of longer-term data storage, so I would tend to say
> "reasonably conservative".
> 
> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
> écrit :
> 
>>
>> I don't have a sense of how conservative Parquet users generally are.
>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
>> format, or would people just not use it?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Tue, 30 Jun 2020 14:33:17 +0200
>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
>>> I'm also in favor of disabling support for now. Having to deal with
>> broken files or the detection of various incompatible implementations in
>> the long-term will harm more than not supporting LZ4 for a while. Snappy is
>> generally more used than LZ4 in this category as it has been available
>> since the inception of Parquet and thus should be considered as a viable
>> alternative.
>>>
>>> Cheers
>>> Uwe
>>>
>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
>> wrote:
>>>>>
>>>>>
>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
>>>>>> hi folks,
>>>>>>
>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
>>>>>> stakeholders in both places)
>>>>>>
>>>>>> It seems there are still problems at least with the C++
>> implementation
>>>>>> of LZ4 compression in Parquet files
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
>>>>>
>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
>> but
>>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
>>>>> compression algorithms available, and they span different parts of
>> the
>>>>> speed/compression spectrum, so it would be a pity to disable one of
>> them.
>>>>
>>>> It's true, however I think it's worse to write LZ4-compressed files
>>>> that cannot be read by other Parquet implementations (if that's what's
>>>> happening as I understand it?). If we are indeed shipping something
>>>> broken then we either should fix it or disable it until it can be
>>>> fixed.
>>>>
>>>>> Regards
>>>>>
>>>>> Antoine.
>>>>
>>>
>>
>>
>>
>>
> 

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Christian Hudon <ch...@elementai.com>.
At least for us, the advantages of Parquet are speed and interoperability
in the context of longer-term data storage, so I would tend to say
"reasonably conservative".

Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
écrit :

>
> I don't have a sense of how conservative Parquet users generally are.
> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> format, or would people just not use it?
>
> Regards
>
> Antoine.
>
>
> On Tue, 30 Jun 2020 14:33:17 +0200
> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> > I'm also in favor of disabling support for now. Having to deal with
> broken files or the detection of various incompatible implementations in
> the long-term will harm more than not supporting LZ4 for a while. Snappy is
> generally more used than LZ4 in this category as it has been available
> since the inception of Parquet and thus should be considered as a viable
> alternative.
> >
> > Cheers
> > Uwe
> >
> > On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> wrote:
> > > >
> > > >
> > > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > > > hi folks,
> > > > >
> > > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > > stakeholders in both places)
> > > > >
> > > > > It seems there are still problems at least with the C++
> implementation
> > > > > of LZ4 compression in Parquet files
> > > > >
> > > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > > >
> > > > I don't have any particular opinion on how to solve the LZ4 issue,
> but
> > > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > > compression algorithms available, and they span different parts of
> the
> > > > speed/compression spectrum, so it would be a pity to disable one of
> them.
> > >
> > > It's true, however I think it's worse to write LZ4-compressed files
> > > that cannot be read by other Parquet implementations (if that's what's
> > > happening as I understand it?). If we are indeed shipping something
> > > broken then we either should fix it or disable it until it can be
> > > fixed.
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
> >
>
>
>
>

-- 


│ Christian Hudon

│ Applied Research Scientist

   Element AI, 6650 Saint-Urbain #500

   Montréal, QC, H2S 3G9, Canada
   Elementai.com

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <so...@pitrou.net>.
I don't have a sense of how conservative Parquet users generally are.
Is it worth adding a LZ4_FRAMED compression option in the Parquet
format, or would people just not use it?

Regards

Antoine.


On Tue, 30 Jun 2020 14:33:17 +0200
"Uwe L. Korn" <uw...@xhochy.com> wrote:
> I'm also in favor of disabling support for now. Having to deal with broken files or the detection of various incompatible implementations in the long-term will harm more than not supporting LZ4 for a while. Snappy is generally more used than LZ4 in this category as it has been available since the inception of Parquet and thus should be considered as a viable alternative.
> 
> Cheers
> Uwe
> 
> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org> wrote:  
> > >
> > >
> > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :  
> > > > hi folks,
> > > >
> > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > stakeholders in both places)
> > > >
> > > > It seems there are still problems at least with the C++ implementation
> > > > of LZ4 compression in Parquet files
> > > >
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > https://issues.apache.org/jira/browse/PARQUET-1878  
> > >
> > > I don't have any particular opinion on how to solve the LZ4 issue, but
> > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > compression algorithms available, and they span different parts of the
> > > speed/compression spectrum, so it would be a pity to disable one of them.  
> > 
> > It's true, however I think it's worse to write LZ4-compressed files
> > that cannot be read by other Parquet implementations (if that's what's
> > happening as I understand it?). If we are indeed shipping something
> > broken then we either should fix it or disable it until it can be
> > fixed.
> >   
> > > Regards
> > >
> > > Antoine.  
> >  
> 




Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <so...@pitrou.net>.
I don't have a sense of how conservative Parquet users generally are.
Is it worth adding a LZ4_FRAMED compression option in the Parquet
format, or would people just not use it?

Regards

Antoine.


On Tue, 30 Jun 2020 14:33:17 +0200
"Uwe L. Korn" <uw...@xhochy.com> wrote:
> I'm also in favor of disabling support for now. Having to deal with broken files or the detection of various incompatible implementations in the long-term will harm more than not supporting LZ4 for a while. Snappy is generally more used than LZ4 in this category as it has been available since the inception of Parquet and thus should be considered as a viable alternative.
> 
> Cheers
> Uwe
> 
> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org> wrote:  
> > >
> > >
> > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :  
> > > > hi folks,
> > > >
> > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > stakeholders in both places)
> > > >
> > > > It seems there are still problems at least with the C++ implementation
> > > > of LZ4 compression in Parquet files
> > > >
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > https://issues.apache.org/jira/browse/PARQUET-1878  
> > >
> > > I don't have any particular opinion on how to solve the LZ4 issue, but
> > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > compression algorithms available, and they span different parts of the
> > > speed/compression spectrum, so it would be a pity to disable one of them.  
> > 
> > It's true, however I think it's worse to write LZ4-compressed files
> > that cannot be read by other Parquet implementations (if that's what's
> > happening as I understand it?). If we are indeed shipping something
> > broken then we either should fix it or disable it until it can be
> > fixed.
> >   
> > > Regards
> > >
> > > Antoine.  
> >  
> 




Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
I'm also in favor of disabling support for now. Having to deal with broken files or the detection of various incompatible implementations in the long-term will harm more than not supporting LZ4 for a while. Snappy is generally more used than LZ4 in this category as it has been available since the inception of Parquet and thus should be considered as a viable alternative.

Cheers
Uwe

On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > hi folks,
> > >
> > > (cross-posting to dev@arrow and dev@parquet since there are
> > > stakeholders in both places)
> > >
> > > It seems there are still problems at least with the C++ implementation
> > > of LZ4 compression in Parquet files
> > >
> > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > https://issues.apache.org/jira/browse/PARQUET-1878
> >
> > I don't have any particular opinion on how to solve the LZ4 issue, but
> > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > compression algorithms available, and they span different parts of the
> > speed/compression spectrum, so it would be a pity to disable one of them.
> 
> It's true, however I think it's worse to write LZ4-compressed files
> that cannot be read by other Parquet implementations (if that's what's
> happening as I understand it?). If we are indeed shipping something
> broken then we either should fix it or disable it until it can be
> fixed.
> 
> > Regards
> >
> > Antoine.
>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
I'm also in favor of disabling support for now. Having to deal with broken files or the detection of various incompatible implementations in the long-term will harm more than not supporting LZ4 for a while. Snappy is generally more used than LZ4 in this category as it has been available since the inception of Parquet and thus should be considered as a viable alternative.

Cheers
Uwe

On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > hi folks,
> > >
> > > (cross-posting to dev@arrow and dev@parquet since there are
> > > stakeholders in both places)
> > >
> > > It seems there are still problems at least with the C++ implementation
> > > of LZ4 compression in Parquet files
> > >
> > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > https://issues.apache.org/jira/browse/PARQUET-1878
> >
> > I don't have any particular opinion on how to solve the LZ4 issue, but
> > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > compression algorithms available, and they span different parts of the
> > speed/compression spectrum, so it would be a pity to disable one of them.
> 
> It's true, however I think it's worse to write LZ4-compressed files
> that cannot be read by other Parquet implementations (if that's what's
> happening as I understand it?). If we are indeed shipping something
> broken then we either should fix it or disable it until it can be
> fixed.
> 
> > Regards
> >
> > Antoine.
>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Wes McKinney <we...@gmail.com>.
On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > hi folks,
> >
> > (cross-posting to dev@arrow and dev@parquet since there are
> > stakeholders in both places)
> >
> > It seems there are still problems at least with the C++ implementation
> > of LZ4 compression in Parquet files
> >
> > https://issues.apache.org/jira/browse/PARQUET-1241
> > https://issues.apache.org/jira/browse/PARQUET-1878
>
> I don't have any particular opinion on how to solve the LZ4 issue, but
> I'd like to mention that LZ4 and ZStandard are the two most efficient
> compression algorithms available, and they span different parts of the
> speed/compression spectrum, so it would be a pity to disable one of them.

It's true, however I think it's worse to write LZ4-compressed files
that cannot be read by other Parquet implementations (if that's what's
happening as I understand it?). If we are indeed shipping something
broken then we either should fix it or disable it until it can be
fixed.

> Regards
>
> Antoine.

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Wes McKinney <we...@gmail.com>.
On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > hi folks,
> >
> > (cross-posting to dev@arrow and dev@parquet since there are
> > stakeholders in both places)
> >
> > It seems there are still problems at least with the C++ implementation
> > of LZ4 compression in Parquet files
> >
> > https://issues.apache.org/jira/browse/PARQUET-1241
> > https://issues.apache.org/jira/browse/PARQUET-1878
>
> I don't have any particular opinion on how to solve the LZ4 issue, but
> I'd like to mention that LZ4 and ZStandard are the two most efficient
> compression algorithms available, and they span different parts of the
> speed/compression spectrum, so it would be a pity to disable one of them.

It's true, however I think it's worse to write LZ4-compressed files
that cannot be read by other Parquet implementations (if that's what's
happening as I understand it?). If we are indeed shipping something
broken then we either should fix it or disable it until it can be
fixed.

> Regards
>
> Antoine.

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <an...@python.org>.
Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> hi folks,
> 
> (cross-posting to dev@arrow and dev@parquet since there are
> stakeholders in both places)
> 
> It seems there are still problems at least with the C++ implementation
> of LZ4 compression in Parquet files
> 
> https://issues.apache.org/jira/browse/PARQUET-1241
> https://issues.apache.org/jira/browse/PARQUET-1878

I don't have any particular opinion on how to solve the LZ4 issue, but
I'd like to mention that LZ4 and ZStandard are the two most efficient
compression algorithms available, and they span different parts of the
speed/compression spectrum, so it would be a pity to disable one of them.

Regards

Antoine.

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <an...@python.org>.
Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> hi folks,
> 
> (cross-posting to dev@arrow and dev@parquet since there are
> stakeholders in both places)
> 
> It seems there are still problems at least with the C++ implementation
> of LZ4 compression in Parquet files
> 
> https://issues.apache.org/jira/browse/PARQUET-1241
> https://issues.apache.org/jira/browse/PARQUET-1878

I don't have any particular opinion on how to solve the LZ4 issue, but
I'd like to mention that LZ4 and ZStandard are the two most efficient
compression algorithms available, and they span different parts of the
speed/compression spectrum, so it would be a pity to disable one of them.

Regards

Antoine.