You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2021/07/20 04:28:30 UTC

Support for DELTA_LENGTH_BYTE_ARRAY?

Hi,

I am trying to add support for DELTA_LENGTH_BYTE_ARRAY in a package, but I
am struggling to find readers of it, despite the fact that the spec states
"This encoding is always preferred over PLAIN for byte array columns.".

* spark 3.X: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY
* pyarrow 4: OSError: Not yet implemented: Unsupported encoding.

Is there any minimal preferred encodings or people just ignore encodings
and use either PLAIN or dict? Or are the encodings just not so much
supported because they do bring sufficient benefits?

Could someone offer some context to the situation?

Best,
Jorge

Re: Support for DELTA_LENGTH_BYTE_ARRAY?

Posted by Micah Kornfield <em...@gmail.com>.

Hi Jorge,
This has been discussed previously, there is an open PR [1] which has lost
some steam to try to figure out minimally supported features.

-Micah

[1] https://github.com/apache/parquet-format/pull/164

On Tue, Jul 20, 2021 at 1:15 AM Kyle Bendickson
<kb...@apple.com.invalid> wrote:

> Oh my apologies -
>
> Definitely there’s substantial benefit from using vectorized encoding,
> both in terms of performance (though I don’t have numbers off hand), and
> additionally with interoperability.
>
> I admittedly have not directly tried it with PyArrow 4.
>
> Work on this would be great though. It would be ideal if the entirety of
> the parquet writer v2 file format could be read, including with
> vectorization enabled.
>
> Here is the Spark code in the latest master branch:
> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet
>
> Trino supports it, so I would also check there. I found this open issue
> for “optimized” Parquet writer, and the next ticket (per this doc) is
> related to Spark interop and the one after that is related to V2 file
> format and “other engines”: https://github.com/trinodb/trino/issues/6382
>
> There has been some work recently to support more vectorization options
> and update the error messages.
>
> If there’s not an open issue in on the Iceberg GitHub, can you open one?
> We could potentially link to the Trino open issue as well, as this issue
> affects Iceberg users on both platforms :)
>
> Let me know if there is any further way I can be of help :)
> Kyle
> 
>
> Kyle Bendickson
> Software Engineer
> Apple
> ACS Data
> One Apple Park Way,
> Cupertino, CA 95014, USA
> kbendickson@apple.com
>
> This email and any attachments may be privileged and may contain
> confidential information intended only for the recipient(s) named above.
> Any other distribution, forwarding, copying or disclosure of this message
> is strictly prohibited. If you have received this email in error, please
> notify me immediately by telephone or return email, and delete this message
> from your system.
>
>
> > On Jul 20, 2021, at 12:04 AM, Kyle Bendickson
> <kb...@apple.com.INVALID> wrote:
> >
> > Hi Joseph.
> >
> > DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature.
> >
> > I do not believe that the Spark ParquetReaders implement page v2 at the
> moment (this might be what you mean when you say you’re working on it).
> >
> > For files generated with DELTA_BYTE_ARRAY_ENCODING (e.g. from Tino),
> I’ve been able to get around it by disabling vectorized parquet reads. This
> has solved the immediate problem of making the files readable for me,
> though of course it comes with the disadvantage of not getting vectorized
> reads.
> >
> > Try changing this setting to false:
> spark.sql.iceberg.vectorization.enabled
> >
> > DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature. I do
> not believe that the Spark ParquetReaders implement page v2 at the moment
> (this might be what you mean when you say you’re working on it).
> >
> > Here’s an issue that should get you some more insight:
> https://github.com/apache/iceberg/issues/2692
> >
> > Let me know if that answers your question!
> > 
> >
> > Kyle Bendickson
> > Software Engineer
> > Apple
> > ACS Data
> > One Apple Park Way,
> > Cupertino, CA 95014, USA
> > kbendickson@apple.com
> >
> > This email and any attachments may be privileged and may contain
> confidential information intended only for the recipient(s) named above.
> Any other distribution, forwarding, copying or disclosure of this message
> is strictly prohibited. If you have received this email in error, please
> notify me immediately by telephone or return email, and delete this message
> from your system.
> >
> >
> >> On Jul 19, 2021, at 9:28 PM, Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I am trying to add support for DELTA_LENGTH_BYTE_ARRAY in a package,
> but I
> >> am struggling to find readers of it, despite the fact that the spec
> states
> >> "This encoding is always preferred over PLAIN for byte array columns.".
> >>
> >> * spark 3.X: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY
> >> * pyarrow 4: OSError: Not yet implemented: Unsupported encoding.
> >>
> >> Is there any minimal preferred encodings or people just ignore encodings
> >> and use either PLAIN or dict? Or are the encodings just not so much
> >> supported because they do bring sufficient benefits?
> >>
> >> Could someone offer some context to the situation?
> >>
> >> Best,
> >> Jorge
> >
>
>

Re: Support for DELTA_LENGTH_BYTE_ARRAY?

Posted by Kyle Bendickson <kb...@apple.com.INVALID>.

Oh my apologies -

Definitely there’s substantial benefit from using vectorized encoding, both in terms of performance (though I don’t have numbers off hand), and additionally with interoperability.

I admittedly have not directly tried it with PyArrow 4.

Work on this would be great though. It would be ideal if the entirety of the parquet writer v2 file format could be read, including with vectorization enabled.

Here is the Spark code in the latest master branch: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet

Trino supports it, so I would also check there. I found this open issue for “optimized” Parquet writer, and the next ticket (per this doc) is related to Spark interop and the one after that is related to V2 file format and “other engines”: https://github.com/trinodb/trino/issues/6382

There has been some work recently to support more vectorization options and update the error messages.

If there’s not an open issue in on the Iceberg GitHub, can you open one? We could potentially link to the Trino open issue as well, as this issue affects Iceberg users on both platforms :)

Let me know if there is any further way I can be of help :)
Kyle


Kyle Bendickson
Software Engineer
Apple
ACS Data
One Apple Park Way,
Cupertino, CA 95014, USA
kbendickson@apple.com

This email and any attachments may be privileged and may contain confidential information intended only for the recipient(s) named above. Any other distribution, forwarding, copying or disclosure of this message is strictly prohibited. If you have received this email in error, please notify me immediately by telephone or return email, and delete this message from your system.

> On Jul 20, 2021, at 12:04 AM, Kyle Bendickson <kb...@apple.com.INVALID> wrote:
> 
> Hi Joseph.
> 
> DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature.
> 
> I do not believe that the Spark ParquetReaders implement page v2 at the moment (this might be what you mean when you say you’re working on it).
> 
> For files generated with DELTA_BYTE_ARRAY_ENCODING (e.g. from Tino), I’ve been able to get around it by disabling vectorized parquet reads. This has solved the immediate problem of making the files readable for me, though of course it comes with the disadvantage of not getting vectorized reads.
> 
> Try changing this setting to false: spark.sql.iceberg.vectorization.enabled
> 
> DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature. I do not believe that the Spark ParquetReaders implement page v2 at the moment (this might be what you mean when you say you’re working on it).
> 
> Here’s an issue that should get you some more insight: https://github.com/apache/iceberg/issues/2692
> 
> Let me know if that answers your question!
> 
> 
> Kyle Bendickson
> Software Engineer
> Apple
> ACS Data
> One Apple Park Way,
> Cupertino, CA 95014, USA
> kbendickson@apple.com
> 
> This email and any attachments may be privileged and may contain confidential information intended only for the recipient(s) named above. Any other distribution, forwarding, copying or disclosure of this message is strictly prohibited. If you have received this email in error, please notify me immediately by telephone or return email, and delete this message from your system.
> 
> 
>> On Jul 19, 2021, at 9:28 PM, Jorge Cardoso Leitão <jo...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> I am trying to add support for DELTA_LENGTH_BYTE_ARRAY in a package, but I
>> am struggling to find readers of it, despite the fact that the spec states
>> "This encoding is always preferred over PLAIN for byte array columns.".
>> 
>> * spark 3.X: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY
>> * pyarrow 4: OSError: Not yet implemented: Unsupported encoding.
>> 
>> Is there any minimal preferred encodings or people just ignore encodings
>> and use either PLAIN or dict? Or are the encodings just not so much
>> supported because they do bring sufficient benefits?
>> 
>> Could someone offer some context to the situation?
>> 
>> Best,
>> Jorge
>

Re: Support for DELTA_LENGTH_BYTE_ARRAY?

Posted by Kyle Bendickson <kb...@apple.com.INVALID>.

Hi Joseph.

DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature.

I do not believe that the Spark ParquetReaders implement page v2 at the moment (this might be what you mean when you say you’re working on it).

For files generated with DELTA_BYTE_ARRAY_ENCODING (e.g. from Tino), I’ve been able to get around it by disabling vectorized parquet reads. This has solved the immediate problem of making the files readable for me, though of course it comes with the disadvantage of not getting vectorized reads.

Try changing this setting to false: spark.sql.iceberg.vectorization.enabled

DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature. I do not believe that the Spark ParquetReaders implement page v2 at the moment (this might be what you mean when you say you’re working on it).

Here’s an issue that should get you some more insight: https://github.com/apache/iceberg/issues/2692

Let me know if that answers your question!


Kyle Bendickson
Software Engineer
Apple
ACS Data
One Apple Park Way,
Cupertino, CA 95014, USA
kbendickson@apple.com

This email and any attachments may be privileged and may contain confidential information intended only for the recipient(s) named above. Any other distribution, forwarding, copying or disclosure of this message is strictly prohibited. If you have received this email in error, please notify me immediately by telephone or return email, and delete this message from your system.

> On Jul 19, 2021, at 9:28 PM, Jorge Cardoso Leitão <jo...@gmail.com> wrote:
> 
> Hi,
> 
> I am trying to add support for DELTA_LENGTH_BYTE_ARRAY in a package, but I
> am struggling to find readers of it, despite the fact that the spec states
> "This encoding is always preferred over PLAIN for byte array columns.".
> 
> * spark 3.X: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY
> * pyarrow 4: OSError: Not yet implemented: Unsupported encoding.
> 
> Is there any minimal preferred encodings or people just ignore encodings
> and use either PLAIN or dict? Or are the encodings just not so much
> supported because they do bring sufficient benefits?
> 
> Could someone offer some context to the situation?
> 
> Best,
> Jorge