You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Antoine Pitrou <so...@pitrou.net> on 2020/07/01 13:31:55 UTC

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

I don't have a sense of how conservative Parquet users generally are.
Is it worth adding a LZ4_FRAMED compression option in the Parquet
format, or would people just not use it?

Regards

Antoine.


On Tue, 30 Jun 2020 14:33:17 +0200
"Uwe L. Korn" <uw...@xhochy.com> wrote:
> I'm also in favor of disabling support for now. Having to deal with broken files or the detection of various incompatible implementations in the long-term will harm more than not supporting LZ4 for a while. Snappy is generally more used than LZ4 in this category as it has been available since the inception of Parquet and thus should be considered as a viable alternative.
> 
> Cheers
> Uwe
> 
> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org> wrote:  
> > >
> > >
> > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :  
> > > > hi folks,
> > > >
> > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > stakeholders in both places)
> > > >
> > > > It seems there are still problems at least with the C++ implementation
> > > > of LZ4 compression in Parquet files
> > > >
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > https://issues.apache.org/jira/browse/PARQUET-1878  
> > >
> > > I don't have any particular opinion on how to solve the LZ4 issue, but
> > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > compression algorithms available, and they span different parts of the
> > > speed/compression spectrum, so it would be a pity to disable one of them.  
> > 
> > It's true, however I think it's worse to write LZ4-compressed files
> > that cannot be read by other Parquet implementations (if that's what's
> > happening as I understand it?). If we are indeed shipping something
> > broken then we either should fix it or disable it until it can be
> > fixed.
> >   
> > > Regards
> > >
> > > Antoine.  
> >  
> 




Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Patrick Pai <pa...@gmail.com>.
I'll volunteer to disable writing/reading LZ4. I'll submit a patch in the next few days.

On 2020/07/12 22:11:33, Wes McKinney <we...@gmail.com> wrote: 
> Since there hasn't been other movement on this, we need to disable
> writing LZ4-compressed files until this can be investigated more
> thoroughly. If someone wants to submit a patch that would be helpful
> otherwise I can take a look in the next couple days
> 
> On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Well, it depends how important speed is, but LZ4 has extremely fast
> > decompression, even compared to Snappy:
> > https://github.com/lz4/lz4#benchmarks
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> > > At least for us, the advantages of Parquet are speed and interoperability
> > > in the context of longer-term data storage, so I would tend to say
> > > "reasonably conservative".
> > >
> > > Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
> > > écrit :
> > >
> > >>
> > >> I don't have a sense of how conservative Parquet users generally are.
> > >> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> > >> format, or would people just not use it?
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >> On Tue, 30 Jun 2020 14:33:17 +0200
> > >> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> > >>> I'm also in favor of disabling support for now. Having to deal with
> > >> broken files or the detection of various incompatible implementations in
> > >> the long-term will harm more than not supporting LZ4 for a while. Snappy is
> > >> generally more used than LZ4 in this category as it has been available
> > >> since the inception of Parquet and thus should be considered as a viable
> > >> alternative.
> > >>>
> > >>> Cheers
> > >>> Uwe
> > >>>
> > >>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > >>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> > >> wrote:
> > >>>>>
> > >>>>>
> > >>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > >>>>>> hi folks,
> > >>>>>>
> > >>>>>> (cross-posting to dev@arrow and dev@parquet since there are
> > >>>>>> stakeholders in both places)
> > >>>>>>
> > >>>>>> It seems there are still problems at least with the C++
> > >> implementation
> > >>>>>> of LZ4 compression in Parquet files
> > >>>>>>
> > >>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
> > >>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
> > >>>>>
> > >>>>> I don't have any particular opinion on how to solve the LZ4 issue,
> > >> but
> > >>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
> > >>>>> compression algorithms available, and they span different parts of
> > >> the
> > >>>>> speed/compression spectrum, so it would be a pity to disable one of
> > >> them.
> > >>>>
> > >>>> It's true, however I think it's worse to write LZ4-compressed files
> > >>>> that cannot be read by other Parquet implementations (if that's what's
> > >>>> happening as I understand it?). If we are indeed shipping something
> > >>>> broken then we either should fix it or disable it until it can be
> > >>>> fixed.
> > >>>>
> > >>>>> Regards
> > >>>>>
> > >>>>> Antoine.
> > >>>>
> > >>>
> > >>
> > >>
> > >>
> > >>
> > >
> 

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <an...@python.org>.
Agreed, but even then, if some Parquet files are generated inside of a
well-defined system which only needs to be interoperable with itself,
it's not necessaril harmful to allow LZ4 compression when writing new files.

Regards

Antoine.


Le 13/07/2020 à 17:07, Wes McKinney a écrit :
> I didn’t say to disable _reading_ them, only writing them.
> 
> On Mon, Jul 13, 2020 at 4:15 AM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> I'm not sure that's a good idea.  There are probably Parquet files that
>> are only ever used with the Arrow implementation (Arrow C++, Arrow
>> Python, Arrow R...).
>>
>> I admit I'm also not terribly bothered about this, since the Parquet
>> community itself doesn't seem to care much about the issue (it has been
>> known for a long time and they could have solved it long ago).
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 13/07/2020 à 00:11, Wes McKinney a écrit :
>>> Since there hasn't been other movement on this, we need to disable
>>> writing LZ4-compressed files until this can be investigated more
>>> thoroughly. If someone wants to submit a patch that would be helpful
>>> otherwise I can take a look in the next couple days
>>>
>>> On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org>
>> wrote:
>>>>
>>>>
>>>> Well, it depends how important speed is, but LZ4 has extremely fast
>>>> decompression, even compared to Snappy:
>>>> https://github.com/lz4/lz4#benchmarks
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
>>>>> At least for us, the advantages of Parquet are speed and
>> interoperability
>>>>> in the context of longer-term data storage, so I would tend to say
>>>>> "reasonably conservative".
>>>>>
>>>>> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net>
>> a
>>>>> écrit :
>>>>>
>>>>>>
>>>>>> I don't have a sense of how conservative Parquet users generally are.
>>>>>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
>>>>>> format, or would people just not use it?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Antoine.
>>>>>>
>>>>>>
>>>>>> On Tue, 30 Jun 2020 14:33:17 +0200
>>>>>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
>>>>>>> I'm also in favor of disabling support for now. Having to deal with
>>>>>> broken files or the detection of various incompatible implementations
>> in
>>>>>> the long-term will harm more than not supporting LZ4 for a while.
>> Snappy is
>>>>>> generally more used than LZ4 in this category as it has been available
>>>>>> since the inception of Parquet and thus should be considered as a
>> viable
>>>>>> alternative.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Uwe
>>>>>>>
>>>>>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
>>>>>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
>>>>>>>>>> hi folks,
>>>>>>>>>>
>>>>>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
>>>>>>>>>> stakeholders in both places)
>>>>>>>>>>
>>>>>>>>>> It seems there are still problems at least with the C++
>>>>>> implementation
>>>>>>>>>> of LZ4 compression in Parquet files
>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
>>>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
>>>>>>>>>
>>>>>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
>>>>>> but
>>>>>>>>> I'd like to mention that LZ4 and ZStandard are the two most
>> efficient
>>>>>>>>> compression algorithms available, and they span different parts of
>>>>>> the
>>>>>>>>> speed/compression spectrum, so it would be a pity to disable one of
>>>>>> them.
>>>>>>>>
>>>>>>>> It's true, however I think it's worse to write LZ4-compressed files
>>>>>>>> that cannot be read by other Parquet implementations (if that's
>> what's
>>>>>>>> happening as I understand it?). If we are indeed shipping something
>>>>>>>> broken then we either should fix it or disable it until it can be
>>>>>>>> fixed.
>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Antoine.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
> 

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Wes McKinney <we...@gmail.com>.
I didn’t say to disable _reading_ them, only writing them.

On Mon, Jul 13, 2020 at 4:15 AM Antoine Pitrou <an...@python.org> wrote:

>
> I'm not sure that's a good idea.  There are probably Parquet files that
> are only ever used with the Arrow implementation (Arrow C++, Arrow
> Python, Arrow R...).
>
> I admit I'm also not terribly bothered about this, since the Parquet
> community itself doesn't seem to care much about the issue (it has been
> known for a long time and they could have solved it long ago).
>
> Regards
>
> Antoine.
>
>
> Le 13/07/2020 à 00:11, Wes McKinney a écrit :
> > Since there hasn't been other movement on this, we need to disable
> > writing LZ4-compressed files until this can be investigated more
> > thoroughly. If someone wants to submit a patch that would be helpful
> > otherwise I can take a look in the next couple days
> >
> > On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org>
> wrote:
> >>
> >>
> >> Well, it depends how important speed is, but LZ4 has extremely fast
> >> decompression, even compared to Snappy:
> >> https://github.com/lz4/lz4#benchmarks
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> >>> At least for us, the advantages of Parquet are speed and
> interoperability
> >>> in the context of longer-term data storage, so I would tend to say
> >>> "reasonably conservative".
> >>>
> >>> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net>
> a
> >>> écrit :
> >>>
> >>>>
> >>>> I don't have a sense of how conservative Parquet users generally are.
> >>>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> >>>> format, or would people just not use it?
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>> On Tue, 30 Jun 2020 14:33:17 +0200
> >>>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> >>>>> I'm also in favor of disabling support for now. Having to deal with
> >>>> broken files or the detection of various incompatible implementations
> in
> >>>> the long-term will harm more than not supporting LZ4 for a while.
> Snappy is
> >>>> generally more used than LZ4 in this category as it has been available
> >>>> since the inception of Parquet and thus should be considered as a
> viable
> >>>> alternative.
> >>>>>
> >>>>> Cheers
> >>>>> Uwe
> >>>>>
> >>>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> >>>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> >>>>>>>> hi folks,
> >>>>>>>>
> >>>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
> >>>>>>>> stakeholders in both places)
> >>>>>>>>
> >>>>>>>> It seems there are still problems at least with the C++
> >>>> implementation
> >>>>>>>> of LZ4 compression in Parquet files
> >>>>>>>>
> >>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
> >>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
> >>>>>>>
> >>>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
> >>>> but
> >>>>>>> I'd like to mention that LZ4 and ZStandard are the two most
> efficient
> >>>>>>> compression algorithms available, and they span different parts of
> >>>> the
> >>>>>>> speed/compression spectrum, so it would be a pity to disable one of
> >>>> them.
> >>>>>>
> >>>>>> It's true, however I think it's worse to write LZ4-compressed files
> >>>>>> that cannot be read by other Parquet implementations (if that's
> what's
> >>>>>> happening as I understand it?). If we are indeed shipping something
> >>>>>> broken then we either should fix it or disable it until it can be
> >>>>>> fixed.
> >>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Krisztián Szűcs <sz...@gmail.com>.
On Mon, Jul 13, 2020 at 11:15 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> I'm not sure that's a good idea.  There are probably Parquet files that
> are only ever used with the Arrow implementation (Arrow C++, Arrow
> Python, Arrow R...).

I tend to agree with Antoine here. As an alternative to disabling the
compression entirely we could explicitly raise a warning and document
the problem.

>
> I admit I'm also not terribly bothered about this, since the Parquet
> community itself doesn't seem to care much about the issue (it has been
> known for a long time and they could have solved it long ago).
>
> Regards
>
> Antoine.
>
>
> Le 13/07/2020 à 00:11, Wes McKinney a écrit :
> > Since there hasn't been other movement on this, we need to disable
> > writing LZ4-compressed files until this can be investigated more
> > thoroughly. If someone wants to submit a patch that would be helpful
> > otherwise I can take a look in the next couple days
> >
> > On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org> wrote:
> >>
> >>
> >> Well, it depends how important speed is, but LZ4 has extremely fast
> >> decompression, even compared to Snappy:
> >> https://github.com/lz4/lz4#benchmarks
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> >>> At least for us, the advantages of Parquet are speed and interoperability
> >>> in the context of longer-term data storage, so I would tend to say
> >>> "reasonably conservative".
> >>>
> >>> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
> >>> écrit :
> >>>
> >>>>
> >>>> I don't have a sense of how conservative Parquet users generally are.
> >>>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> >>>> format, or would people just not use it?
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>> On Tue, 30 Jun 2020 14:33:17 +0200
> >>>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> >>>>> I'm also in favor of disabling support for now. Having to deal with
> >>>> broken files or the detection of various incompatible implementations in
> >>>> the long-term will harm more than not supporting LZ4 for a while. Snappy is
> >>>> generally more used than LZ4 in this category as it has been available
> >>>> since the inception of Parquet and thus should be considered as a viable
> >>>> alternative.
> >>>>>
> >>>>> Cheers
> >>>>> Uwe
> >>>>>
> >>>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> >>>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> >>>>>>>> hi folks,
> >>>>>>>>
> >>>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
> >>>>>>>> stakeholders in both places)
> >>>>>>>>
> >>>>>>>> It seems there are still problems at least with the C++
> >>>> implementation
> >>>>>>>> of LZ4 compression in Parquet files
> >>>>>>>>
> >>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
> >>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
> >>>>>>>
> >>>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
> >>>> but
> >>>>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
> >>>>>>> compression algorithms available, and they span different parts of
> >>>> the
> >>>>>>> speed/compression spectrum, so it would be a pity to disable one of
> >>>> them.
> >>>>>>
> >>>>>> It's true, however I think it's worse to write LZ4-compressed files
> >>>>>> that cannot be read by other Parquet implementations (if that's what's
> >>>>>> happening as I understand it?). If we are indeed shipping something
> >>>>>> broken then we either should fix it or disable it until it can be
> >>>>>> fixed.
> >>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <an...@python.org>.
I'm not sure that's a good idea.  There are probably Parquet files that
are only ever used with the Arrow implementation (Arrow C++, Arrow
Python, Arrow R...).

I admit I'm also not terribly bothered about this, since the Parquet
community itself doesn't seem to care much about the issue (it has been
known for a long time and they could have solved it long ago).

Regards

Antoine.


Le 13/07/2020 à 00:11, Wes McKinney a écrit :
> Since there hasn't been other movement on this, we need to disable
> writing LZ4-compressed files until this can be investigated more
> thoroughly. If someone wants to submit a patch that would be helpful
> otherwise I can take a look in the next couple days
> 
> On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> Well, it depends how important speed is, but LZ4 has extremely fast
>> decompression, even compared to Snappy:
>> https://github.com/lz4/lz4#benchmarks
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
>>> At least for us, the advantages of Parquet are speed and interoperability
>>> in the context of longer-term data storage, so I would tend to say
>>> "reasonably conservative".
>>>
>>> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
>>> écrit :
>>>
>>>>
>>>> I don't have a sense of how conservative Parquet users generally are.
>>>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
>>>> format, or would people just not use it?
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> On Tue, 30 Jun 2020 14:33:17 +0200
>>>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
>>>>> I'm also in favor of disabling support for now. Having to deal with
>>>> broken files or the detection of various incompatible implementations in
>>>> the long-term will harm more than not supporting LZ4 for a while. Snappy is
>>>> generally more used than LZ4 in this category as it has been available
>>>> since the inception of Parquet and thus should be considered as a viable
>>>> alternative.
>>>>>
>>>>> Cheers
>>>>> Uwe
>>>>>
>>>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
>>>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
>>>>>>>> hi folks,
>>>>>>>>
>>>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
>>>>>>>> stakeholders in both places)
>>>>>>>>
>>>>>>>> It seems there are still problems at least with the C++
>>>> implementation
>>>>>>>> of LZ4 compression in Parquet files
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
>>>>>>>
>>>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
>>>> but
>>>>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
>>>>>>> compression algorithms available, and they span different parts of
>>>> the
>>>>>>> speed/compression spectrum, so it would be a pity to disable one of
>>>> them.
>>>>>>
>>>>>> It's true, however I think it's worse to write LZ4-compressed files
>>>>>> that cannot be read by other Parquet implementations (if that's what's
>>>>>> happening as I understand it?). If we are indeed shipping something
>>>>>> broken then we either should fix it or disable it until it can be
>>>>>> fixed.
>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Antoine.
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Wes McKinney <we...@gmail.com>.
Since there hasn't been other movement on this, we need to disable
writing LZ4-compressed files until this can be investigated more
thoroughly. If someone wants to submit a patch that would be helpful
otherwise I can take a look in the next couple days

On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Well, it depends how important speed is, but LZ4 has extremely fast
> decompression, even compared to Snappy:
> https://github.com/lz4/lz4#benchmarks
>
> Regards
>
> Antoine.
>
>
> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> > At least for us, the advantages of Parquet are speed and interoperability
> > in the context of longer-term data storage, so I would tend to say
> > "reasonably conservative".
> >
> > Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
> > écrit :
> >
> >>
> >> I don't have a sense of how conservative Parquet users generally are.
> >> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> >> format, or would people just not use it?
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> On Tue, 30 Jun 2020 14:33:17 +0200
> >> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> >>> I'm also in favor of disabling support for now. Having to deal with
> >> broken files or the detection of various incompatible implementations in
> >> the long-term will harm more than not supporting LZ4 for a while. Snappy is
> >> generally more used than LZ4 in this category as it has been available
> >> since the inception of Parquet and thus should be considered as a viable
> >> alternative.
> >>>
> >>> Cheers
> >>> Uwe
> >>>
> >>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> >>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> >> wrote:
> >>>>>
> >>>>>
> >>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> >>>>>> hi folks,
> >>>>>>
> >>>>>> (cross-posting to dev@arrow and dev@parquet since there are
> >>>>>> stakeholders in both places)
> >>>>>>
> >>>>>> It seems there are still problems at least with the C++
> >> implementation
> >>>>>> of LZ4 compression in Parquet files
> >>>>>>
> >>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
> >>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
> >>>>>
> >>>>> I don't have any particular opinion on how to solve the LZ4 issue,
> >> but
> >>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
> >>>>> compression algorithms available, and they span different parts of
> >> the
> >>>>> speed/compression spectrum, so it would be a pity to disable one of
> >> them.
> >>>>
> >>>> It's true, however I think it's worse to write LZ4-compressed files
> >>>> that cannot be read by other Parquet implementations (if that's what's
> >>>> happening as I understand it?). If we are indeed shipping something
> >>>> broken then we either should fix it or disable it until it can be
> >>>> fixed.
> >>>>
> >>>>> Regards
> >>>>>
> >>>>> Antoine.
> >>>>
> >>>
> >>
> >>
> >>
> >>
> >

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Antoine Pitrou <an...@python.org>.
Well, it depends how important speed is, but LZ4 has extremely fast
decompression, even compared to Snappy:
https://github.com/lz4/lz4#benchmarks

Regards

Antoine.


Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> At least for us, the advantages of Parquet are speed and interoperability
> in the context of longer-term data storage, so I would tend to say
> "reasonably conservative".
> 
> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
> écrit :
> 
>>
>> I don't have a sense of how conservative Parquet users generally are.
>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
>> format, or would people just not use it?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Tue, 30 Jun 2020 14:33:17 +0200
>> "Uwe L. Korn" <uw...@xhochy.com> wrote:
>>> I'm also in favor of disabling support for now. Having to deal with
>> broken files or the detection of various incompatible implementations in
>> the long-term will harm more than not supporting LZ4 for a while. Snappy is
>> generally more used than LZ4 in this category as it has been available
>> since the inception of Parquet and thus should be considered as a viable
>> alternative.
>>>
>>> Cheers
>>> Uwe
>>>
>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
>> wrote:
>>>>>
>>>>>
>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
>>>>>> hi folks,
>>>>>>
>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
>>>>>> stakeholders in both places)
>>>>>>
>>>>>> It seems there are still problems at least with the C++
>> implementation
>>>>>> of LZ4 compression in Parquet files
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
>>>>>
>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
>> but
>>>>> I'd like to mention that LZ4 and ZStandard are the two most efficient
>>>>> compression algorithms available, and they span different parts of
>> the
>>>>> speed/compression spectrum, so it would be a pity to disable one of
>> them.
>>>>
>>>> It's true, however I think it's worse to write LZ4-compressed files
>>>> that cannot be read by other Parquet implementations (if that's what's
>>>> happening as I understand it?). If we are indeed shipping something
>>>> broken then we either should fix it or disable it until it can be
>>>> fixed.
>>>>
>>>>> Regards
>>>>>
>>>>> Antoine.
>>>>
>>>
>>
>>
>>
>>
> 

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Posted by Christian Hudon <ch...@elementai.com>.
At least for us, the advantages of Parquet are speed and interoperability
in the context of longer-term data storage, so I would tend to say
"reasonably conservative".

Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <so...@pitrou.net> a
écrit :

>
> I don't have a sense of how conservative Parquet users generally are.
> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> format, or would people just not use it?
>
> Regards
>
> Antoine.
>
>
> On Tue, 30 Jun 2020 14:33:17 +0200
> "Uwe L. Korn" <uw...@xhochy.com> wrote:
> > I'm also in favor of disabling support for now. Having to deal with
> broken files or the detection of various incompatible implementations in
> the long-term will harm more than not supporting LZ4 for a while. Snappy is
> generally more used than LZ4 in this category as it has been available
> since the inception of Parquet and thus should be considered as a viable
> alternative.
> >
> > Cheers
> > Uwe
> >
> > On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <an...@python.org>
> wrote:
> > > >
> > > >
> > > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > > > hi folks,
> > > > >
> > > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > > stakeholders in both places)
> > > > >
> > > > > It seems there are still problems at least with the C++
> implementation
> > > > > of LZ4 compression in Parquet files
> > > > >
> > > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > > >
> > > > I don't have any particular opinion on how to solve the LZ4 issue,
> but
> > > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > > compression algorithms available, and they span different parts of
> the
> > > > speed/compression spectrum, so it would be a pity to disable one of
> them.
> > >
> > > It's true, however I think it's worse to write LZ4-compressed files
> > > that cannot be read by other Parquet implementations (if that's what's
> > > happening as I understand it?). If we are indeed shipping something
> > > broken then we either should fix it or disable it until it can be
> > > fixed.
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
> >
>
>
>
>

-- 


│ Christian Hudon

│ Applied Research Scientist

   Element AI, 6650 Saint-Urbain #500

   Montréal, QC, H2S 3G9, Canada
   Elementai.com