You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Brian Bowman <Br...@sas.com> on 2019/04/05 17:22:54 UTC

Need 64-bit Integer length for Parquet ByteArray Type

All,

SAS requires support for storing varying-length character and binary blobs with a 2^64 max length in Parquet.   Currently, the ByteArray len field is a unint32_t.   Looks this the will require incrementing the Parquet file format version and changing ByteArray len to uint64_t.

Have there been any requests for this or other Parquet developments that require file format versioning changes?

I realize this a non-trivial ask.  Thanks for considering it.

-Brian

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
Hello Wes,

Thanks for the info!  I'm working to better understand Parquet/Arrow design and development processes.   No hurry for LARGE_BYTE_ARRAY.

-Brian


On 4/26/19, 11:14 AM, "Wes McKinney" <we...@gmail.com> wrote:

    EXTERNAL
    
    hi Brian,
    
    I doubt that such a change could be made on a short time horizon.
    Collecting feedback and building consensus (if it is even possible)
    with stakeholders would take some time. The appropriate place to have
    the discussion is here on the mailing list, though
    
    Thanks
    
    On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman <Br...@sas.com> wrote:
    >
    > Hello Wes/all,
    >
    > A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives.  Is this something that could be done in Parquet over the next few months?  I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++.
    >
    > -Brian
    >
    > On 4/5/19, 3:44 PM, "Wes McKinney" <we...@gmail.com> wrote:
    >
    >     EXTERNAL
    >
    >     hi Brian,
    >
    >     Just to comment from the C++ side -- the 64-bit issue is a limitation
    >     of the Parquet format itself and not related to the C++
    >     implementation. It would be possibly interesting to add a
    >     LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
    >     doing much the same in Apache Arrow for in-memory)
    >
    >     - Wes
    >
    >     On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
    >     >
    >     > I don't think that's what you would want to do. Parquet will eventually
    >     > compress large values, but not after making defensive copies and attempting
    >     > to encode them. In the end, it will be a lot more overhead, plus the work
    >     > to make it possible. I think you'd be much better of compressing before
    >     > storing in Parquet if you expect good compression rates.
    >     >
    >     > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:
    >     >
    >     > > My hope is that these large ByteArray values will encode/compress to a
    >     > > fraction of their original size.  FWIW, cpp/src/parquet/
    >     > > column_writer.cc/.h has int64_t offset and length fields all over the
    >     > > place.
    >     > >
    >     > > External file references to BLOBS is doable but not the elegant,
    >     > > integrated solution I was hoping for.
    >     > >
    >     > > -Brian
    >     > >
    >     > > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
    >     > >
    >     > > *EXTERNAL*
    >     > > Looks like we will need a new encoding for this:
    >     > > https://github.com/apache/parquet-format/blob/master/Encodings.md
    >     > >
    >     > > That doc specifies that the plain encoding uses a 4-byte length. That's
    >     > > not going to be a quick fix.
    >     > >
    >     > > Now that I'm thinking about this a bit more, does it make sense to support
    >     > > byte arrays that are more than 2GB? That's far larger than the size of a
    >     > > row group, let alone a page. This would completely break memory management
    >     > > in the JVM implementation.
    >     > >
    >     > > Can you solve this problem using a BLOB type that references an external
    >     > > file with the gigantic values? Seems to me that values this large should go
    >     > > in separate files, not in a Parquet file where it would destroy any benefit
    >     > > from using the format.
    >     > >
    >     > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
    >     > >
    >     > >> Hello Ryan,
    >     > >>
    >     > >> Looks like it's limited by both the Parquet implementation and the Thrift
    >     > >> message methods.  Am I missing anything?
    >     > >>
    >     > >> From cpp/src/parquet/types.h
    >     > >>
    >     > >> struct ByteArray {
    >     > >>   ByteArray() : len(0), ptr(NULLPTR) {}
    >     > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
    >     > >>   uint32_t len;
    >     > >>   const uint8_t* ptr;
    >     > >> };
    >     > >>
    >     > >> From cpp/src/parquet/thrift.h
    >     > >>
    >     > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
    >     > >> deserialized_msg) {
    >     > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
    >     > >> out)
    >     > >>
    >     > >> -Brian
    >     > >>
    >     > >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
    >     > >>
    >     > >>     EXTERNAL
    >     > >>
    >     > >>     Hi Brian,
    >     > >>
    >     > >>     This seems like something we should allow. What imposes the current
    >     > >> limit?
    >     > >>     Is it in the thrift format, or just the implementations?
    >     > >>
    >     > >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
    >     > >> wrote:
    >     > >>
    >     > >>     > All,
    >     > >>     >
    >     > >>     > SAS requires support for storing varying-length character and
    >     > >> binary blobs
    >     > >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
    >     > >> field is
    >     > >>     > a unint32_t.   Looks this the will require incrementing the Parquet
    >     > >> file
    >     > >>     > format version and changing ByteArray len to uint64_t.
    >     > >>     >
    >     > >>     > Have there been any requests for this or other Parquet developments
    >     > >> that
    >     > >>     > require file format versioning changes?
    >     > >>     >
    >     > >>     > I realize this a non-trivial ask.  Thanks for considering it.
    >     > >>     >
    >     > >>     > -Brian
    >     > >>     >
    >     > >>
    >     > >>
    >     > >>     --
    >     > >>     Ryan Blue
    >     > >>     Software Engineer
    >     > >>     Netflix
    >     > >>
    >     > >>
    >     > >>
    >     > >
    >     > > --
    >     > > Ryan Blue
    >     > > Software Engineer
    >     > > Netflix
    >     > >
    >     > >
    >     >
    >     > --
    >     > Ryan Blue
    >     > Software Engineer
    >     > Netflix
    >
    >
    


Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
Hello Wes,

Thanks for the info!  I'm working to better understand Parquet/Arrow design and development processes.   No hurry for LARGE_BYTE_ARRAY.

-Brian


On 4/26/19, 11:14 AM, "Wes McKinney" <we...@gmail.com> wrote:

    EXTERNAL
    
    hi Brian,
    
    I doubt that such a change could be made on a short time horizon.
    Collecting feedback and building consensus (if it is even possible)
    with stakeholders would take some time. The appropriate place to have
    the discussion is here on the mailing list, though
    
    Thanks
    
    On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman <Br...@sas.com> wrote:
    >
    > Hello Wes/all,
    >
    > A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives.  Is this something that could be done in Parquet over the next few months?  I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++.
    >
    > -Brian
    >
    > On 4/5/19, 3:44 PM, "Wes McKinney" <we...@gmail.com> wrote:
    >
    >     EXTERNAL
    >
    >     hi Brian,
    >
    >     Just to comment from the C++ side -- the 64-bit issue is a limitation
    >     of the Parquet format itself and not related to the C++
    >     implementation. It would be possibly interesting to add a
    >     LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
    >     doing much the same in Apache Arrow for in-memory)
    >
    >     - Wes
    >
    >     On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
    >     >
    >     > I don't think that's what you would want to do. Parquet will eventually
    >     > compress large values, but not after making defensive copies and attempting
    >     > to encode them. In the end, it will be a lot more overhead, plus the work
    >     > to make it possible. I think you'd be much better of compressing before
    >     > storing in Parquet if you expect good compression rates.
    >     >
    >     > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:
    >     >
    >     > > My hope is that these large ByteArray values will encode/compress to a
    >     > > fraction of their original size.  FWIW, cpp/src/parquet/
    >     > > column_writer.cc/.h has int64_t offset and length fields all over the
    >     > > place.
    >     > >
    >     > > External file references to BLOBS is doable but not the elegant,
    >     > > integrated solution I was hoping for.
    >     > >
    >     > > -Brian
    >     > >
    >     > > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
    >     > >
    >     > > *EXTERNAL*
    >     > > Looks like we will need a new encoding for this:
    >     > > https://github.com/apache/parquet-format/blob/master/Encodings.md
    >     > >
    >     > > That doc specifies that the plain encoding uses a 4-byte length. That's
    >     > > not going to be a quick fix.
    >     > >
    >     > > Now that I'm thinking about this a bit more, does it make sense to support
    >     > > byte arrays that are more than 2GB? That's far larger than the size of a
    >     > > row group, let alone a page. This would completely break memory management
    >     > > in the JVM implementation.
    >     > >
    >     > > Can you solve this problem using a BLOB type that references an external
    >     > > file with the gigantic values? Seems to me that values this large should go
    >     > > in separate files, not in a Parquet file where it would destroy any benefit
    >     > > from using the format.
    >     > >
    >     > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
    >     > >
    >     > >> Hello Ryan,
    >     > >>
    >     > >> Looks like it's limited by both the Parquet implementation and the Thrift
    >     > >> message methods.  Am I missing anything?
    >     > >>
    >     > >> From cpp/src/parquet/types.h
    >     > >>
    >     > >> struct ByteArray {
    >     > >>   ByteArray() : len(0), ptr(NULLPTR) {}
    >     > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
    >     > >>   uint32_t len;
    >     > >>   const uint8_t* ptr;
    >     > >> };
    >     > >>
    >     > >> From cpp/src/parquet/thrift.h
    >     > >>
    >     > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
    >     > >> deserialized_msg) {
    >     > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
    >     > >> out)
    >     > >>
    >     > >> -Brian
    >     > >>
    >     > >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
    >     > >>
    >     > >>     EXTERNAL
    >     > >>
    >     > >>     Hi Brian,
    >     > >>
    >     > >>     This seems like something we should allow. What imposes the current
    >     > >> limit?
    >     > >>     Is it in the thrift format, or just the implementations?
    >     > >>
    >     > >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
    >     > >> wrote:
    >     > >>
    >     > >>     > All,
    >     > >>     >
    >     > >>     > SAS requires support for storing varying-length character and
    >     > >> binary blobs
    >     > >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
    >     > >> field is
    >     > >>     > a unint32_t.   Looks this the will require incrementing the Parquet
    >     > >> file
    >     > >>     > format version and changing ByteArray len to uint64_t.
    >     > >>     >
    >     > >>     > Have there been any requests for this or other Parquet developments
    >     > >> that
    >     > >>     > require file format versioning changes?
    >     > >>     >
    >     > >>     > I realize this a non-trivial ask.  Thanks for considering it.
    >     > >>     >
    >     > >>     > -Brian
    >     > >>     >
    >     > >>
    >     > >>
    >     > >>     --
    >     > >>     Ryan Blue
    >     > >>     Software Engineer
    >     > >>     Netflix
    >     > >>
    >     > >>
    >     > >>
    >     > >
    >     > > --
    >     > > Ryan Blue
    >     > > Software Engineer
    >     > > Netflix
    >     > >
    >     > >
    >     >
    >     > --
    >     > Ryan Blue
    >     > Software Engineer
    >     > Netflix
    >
    >
    


Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Wes McKinney <we...@gmail.com>.
hi Brian,

I doubt that such a change could be made on a short time horizon.
Collecting feedback and building consensus (if it is even possible)
with stakeholders would take some time. The appropriate place to have
the discussion is here on the mailing list, though

Thanks

On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman <Br...@sas.com> wrote:
>
> Hello Wes/all,
>
> A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives.  Is this something that could be done in Parquet over the next few months?  I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++.
>
> -Brian
>
> On 4/5/19, 3:44 PM, "Wes McKinney" <we...@gmail.com> wrote:
>
>     EXTERNAL
>
>     hi Brian,
>
>     Just to comment from the C++ side -- the 64-bit issue is a limitation
>     of the Parquet format itself and not related to the C++
>     implementation. It would be possibly interesting to add a
>     LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
>     doing much the same in Apache Arrow for in-memory)
>
>     - Wes
>
>     On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>     >
>     > I don't think that's what you would want to do. Parquet will eventually
>     > compress large values, but not after making defensive copies and attempting
>     > to encode them. In the end, it will be a lot more overhead, plus the work
>     > to make it possible. I think you'd be much better of compressing before
>     > storing in Parquet if you expect good compression rates.
>     >
>     > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:
>     >
>     > > My hope is that these large ByteArray values will encode/compress to a
>     > > fraction of their original size.  FWIW, cpp/src/parquet/
>     > > column_writer.cc/.h has int64_t offset and length fields all over the
>     > > place.
>     > >
>     > > External file references to BLOBS is doable but not the elegant,
>     > > integrated solution I was hoping for.
>     > >
>     > > -Brian
>     > >
>     > > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
>     > >
>     > > *EXTERNAL*
>     > > Looks like we will need a new encoding for this:
>     > > https://github.com/apache/parquet-format/blob/master/Encodings.md
>     > >
>     > > That doc specifies that the plain encoding uses a 4-byte length. That's
>     > > not going to be a quick fix.
>     > >
>     > > Now that I'm thinking about this a bit more, does it make sense to support
>     > > byte arrays that are more than 2GB? That's far larger than the size of a
>     > > row group, let alone a page. This would completely break memory management
>     > > in the JVM implementation.
>     > >
>     > > Can you solve this problem using a BLOB type that references an external
>     > > file with the gigantic values? Seems to me that values this large should go
>     > > in separate files, not in a Parquet file where it would destroy any benefit
>     > > from using the format.
>     > >
>     > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
>     > >
>     > >> Hello Ryan,
>     > >>
>     > >> Looks like it's limited by both the Parquet implementation and the Thrift
>     > >> message methods.  Am I missing anything?
>     > >>
>     > >> From cpp/src/parquet/types.h
>     > >>
>     > >> struct ByteArray {
>     > >>   ByteArray() : len(0), ptr(NULLPTR) {}
>     > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
>     > >>   uint32_t len;
>     > >>   const uint8_t* ptr;
>     > >> };
>     > >>
>     > >> From cpp/src/parquet/thrift.h
>     > >>
>     > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
>     > >> deserialized_msg) {
>     > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
>     > >> out)
>     > >>
>     > >> -Brian
>     > >>
>     > >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
>     > >>
>     > >>     EXTERNAL
>     > >>
>     > >>     Hi Brian,
>     > >>
>     > >>     This seems like something we should allow. What imposes the current
>     > >> limit?
>     > >>     Is it in the thrift format, or just the implementations?
>     > >>
>     > >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
>     > >> wrote:
>     > >>
>     > >>     > All,
>     > >>     >
>     > >>     > SAS requires support for storing varying-length character and
>     > >> binary blobs
>     > >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
>     > >> field is
>     > >>     > a unint32_t.   Looks this the will require incrementing the Parquet
>     > >> file
>     > >>     > format version and changing ByteArray len to uint64_t.
>     > >>     >
>     > >>     > Have there been any requests for this or other Parquet developments
>     > >> that
>     > >>     > require file format versioning changes?
>     > >>     >
>     > >>     > I realize this a non-trivial ask.  Thanks for considering it.
>     > >>     >
>     > >>     > -Brian
>     > >>     >
>     > >>
>     > >>
>     > >>     --
>     > >>     Ryan Blue
>     > >>     Software Engineer
>     > >>     Netflix
>     > >>
>     > >>
>     > >>
>     > >
>     > > --
>     > > Ryan Blue
>     > > Software Engineer
>     > > Netflix
>     > >
>     > >
>     >
>     > --
>     > Ryan Blue
>     > Software Engineer
>     > Netflix
>
>

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Wes McKinney <we...@gmail.com>.
hi Brian,

I doubt that such a change could be made on a short time horizon.
Collecting feedback and building consensus (if it is even possible)
with stakeholders would take some time. The appropriate place to have
the discussion is here on the mailing list, though

Thanks

On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman <Br...@sas.com> wrote:
>
> Hello Wes/all,
>
> A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives.  Is this something that could be done in Parquet over the next few months?  I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++.
>
> -Brian
>
> On 4/5/19, 3:44 PM, "Wes McKinney" <we...@gmail.com> wrote:
>
>     EXTERNAL
>
>     hi Brian,
>
>     Just to comment from the C++ side -- the 64-bit issue is a limitation
>     of the Parquet format itself and not related to the C++
>     implementation. It would be possibly interesting to add a
>     LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
>     doing much the same in Apache Arrow for in-memory)
>
>     - Wes
>
>     On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>     >
>     > I don't think that's what you would want to do. Parquet will eventually
>     > compress large values, but not after making defensive copies and attempting
>     > to encode them. In the end, it will be a lot more overhead, plus the work
>     > to make it possible. I think you'd be much better of compressing before
>     > storing in Parquet if you expect good compression rates.
>     >
>     > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:
>     >
>     > > My hope is that these large ByteArray values will encode/compress to a
>     > > fraction of their original size.  FWIW, cpp/src/parquet/
>     > > column_writer.cc/.h has int64_t offset and length fields all over the
>     > > place.
>     > >
>     > > External file references to BLOBS is doable but not the elegant,
>     > > integrated solution I was hoping for.
>     > >
>     > > -Brian
>     > >
>     > > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
>     > >
>     > > *EXTERNAL*
>     > > Looks like we will need a new encoding for this:
>     > > https://github.com/apache/parquet-format/blob/master/Encodings.md
>     > >
>     > > That doc specifies that the plain encoding uses a 4-byte length. That's
>     > > not going to be a quick fix.
>     > >
>     > > Now that I'm thinking about this a bit more, does it make sense to support
>     > > byte arrays that are more than 2GB? That's far larger than the size of a
>     > > row group, let alone a page. This would completely break memory management
>     > > in the JVM implementation.
>     > >
>     > > Can you solve this problem using a BLOB type that references an external
>     > > file with the gigantic values? Seems to me that values this large should go
>     > > in separate files, not in a Parquet file where it would destroy any benefit
>     > > from using the format.
>     > >
>     > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
>     > >
>     > >> Hello Ryan,
>     > >>
>     > >> Looks like it's limited by both the Parquet implementation and the Thrift
>     > >> message methods.  Am I missing anything?
>     > >>
>     > >> From cpp/src/parquet/types.h
>     > >>
>     > >> struct ByteArray {
>     > >>   ByteArray() : len(0), ptr(NULLPTR) {}
>     > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
>     > >>   uint32_t len;
>     > >>   const uint8_t* ptr;
>     > >> };
>     > >>
>     > >> From cpp/src/parquet/thrift.h
>     > >>
>     > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
>     > >> deserialized_msg) {
>     > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
>     > >> out)
>     > >>
>     > >> -Brian
>     > >>
>     > >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
>     > >>
>     > >>     EXTERNAL
>     > >>
>     > >>     Hi Brian,
>     > >>
>     > >>     This seems like something we should allow. What imposes the current
>     > >> limit?
>     > >>     Is it in the thrift format, or just the implementations?
>     > >>
>     > >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
>     > >> wrote:
>     > >>
>     > >>     > All,
>     > >>     >
>     > >>     > SAS requires support for storing varying-length character and
>     > >> binary blobs
>     > >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
>     > >> field is
>     > >>     > a unint32_t.   Looks this the will require incrementing the Parquet
>     > >> file
>     > >>     > format version and changing ByteArray len to uint64_t.
>     > >>     >
>     > >>     > Have there been any requests for this or other Parquet developments
>     > >> that
>     > >>     > require file format versioning changes?
>     > >>     >
>     > >>     > I realize this a non-trivial ask.  Thanks for considering it.
>     > >>     >
>     > >>     > -Brian
>     > >>     >
>     > >>
>     > >>
>     > >>     --
>     > >>     Ryan Blue
>     > >>     Software Engineer
>     > >>     Netflix
>     > >>
>     > >>
>     > >>
>     > >
>     > > --
>     > > Ryan Blue
>     > > Software Engineer
>     > > Netflix
>     > >
>     > >
>     >
>     > --
>     > Ryan Blue
>     > Software Engineer
>     > Netflix
>
>

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
Hello Wes/all,

A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives.  Is this something that could be done in Parquet over the next few months?  I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++.

-Brian

On 4/5/19, 3:44 PM, "Wes McKinney" <we...@gmail.com> wrote:

    EXTERNAL
    
    hi Brian,
    
    Just to comment from the C++ side -- the 64-bit issue is a limitation
    of the Parquet format itself and not related to the C++
    implementation. It would be possibly interesting to add a
    LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
    doing much the same in Apache Arrow for in-memory)
    
    - Wes
    
    On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
    >
    > I don't think that's what you would want to do. Parquet will eventually
    > compress large values, but not after making defensive copies and attempting
    > to encode them. In the end, it will be a lot more overhead, plus the work
    > to make it possible. I think you'd be much better of compressing before
    > storing in Parquet if you expect good compression rates.
    >
    > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:
    >
    > > My hope is that these large ByteArray values will encode/compress to a
    > > fraction of their original size.  FWIW, cpp/src/parquet/
    > > column_writer.cc/.h has int64_t offset and length fields all over the
    > > place.
    > >
    > > External file references to BLOBS is doable but not the elegant,
    > > integrated solution I was hoping for.
    > >
    > > -Brian
    > >
    > > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
    > >
    > > *EXTERNAL*
    > > Looks like we will need a new encoding for this:
    > > https://github.com/apache/parquet-format/blob/master/Encodings.md
    > >
    > > That doc specifies that the plain encoding uses a 4-byte length. That's
    > > not going to be a quick fix.
    > >
    > > Now that I'm thinking about this a bit more, does it make sense to support
    > > byte arrays that are more than 2GB? That's far larger than the size of a
    > > row group, let alone a page. This would completely break memory management
    > > in the JVM implementation.
    > >
    > > Can you solve this problem using a BLOB type that references an external
    > > file with the gigantic values? Seems to me that values this large should go
    > > in separate files, not in a Parquet file where it would destroy any benefit
    > > from using the format.
    > >
    > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
    > >
    > >> Hello Ryan,
    > >>
    > >> Looks like it's limited by both the Parquet implementation and the Thrift
    > >> message methods.  Am I missing anything?
    > >>
    > >> From cpp/src/parquet/types.h
    > >>
    > >> struct ByteArray {
    > >>   ByteArray() : len(0), ptr(NULLPTR) {}
    > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
    > >>   uint32_t len;
    > >>   const uint8_t* ptr;
    > >> };
    > >>
    > >> From cpp/src/parquet/thrift.h
    > >>
    > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
    > >> deserialized_msg) {
    > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
    > >> out)
    > >>
    > >> -Brian
    > >>
    > >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
    > >>
    > >>     EXTERNAL
    > >>
    > >>     Hi Brian,
    > >>
    > >>     This seems like something we should allow. What imposes the current
    > >> limit?
    > >>     Is it in the thrift format, or just the implementations?
    > >>
    > >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
    > >> wrote:
    > >>
    > >>     > All,
    > >>     >
    > >>     > SAS requires support for storing varying-length character and
    > >> binary blobs
    > >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
    > >> field is
    > >>     > a unint32_t.   Looks this the will require incrementing the Parquet
    > >> file
    > >>     > format version and changing ByteArray len to uint64_t.
    > >>     >
    > >>     > Have there been any requests for this or other Parquet developments
    > >> that
    > >>     > require file format versioning changes?
    > >>     >
    > >>     > I realize this a non-trivial ask.  Thanks for considering it.
    > >>     >
    > >>     > -Brian
    > >>     >
    > >>
    > >>
    > >>     --
    > >>     Ryan Blue
    > >>     Software Engineer
    > >>     Netflix
    > >>
    > >>
    > >>
    > >
    > > --
    > > Ryan Blue
    > > Software Engineer
    > > Netflix
    > >
    > >
    >
    > --
    > Ryan Blue
    > Software Engineer
    > Netflix
    


Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
Hello Wes/all,

A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives.  Is this something that could be done in Parquet over the next few months?  I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++.

-Brian

On 4/5/19, 3:44 PM, "Wes McKinney" <we...@gmail.com> wrote:

    EXTERNAL
    
    hi Brian,
    
    Just to comment from the C++ side -- the 64-bit issue is a limitation
    of the Parquet format itself and not related to the C++
    implementation. It would be possibly interesting to add a
    LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
    doing much the same in Apache Arrow for in-memory)
    
    - Wes
    
    On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
    >
    > I don't think that's what you would want to do. Parquet will eventually
    > compress large values, but not after making defensive copies and attempting
    > to encode them. In the end, it will be a lot more overhead, plus the work
    > to make it possible. I think you'd be much better of compressing before
    > storing in Parquet if you expect good compression rates.
    >
    > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:
    >
    > > My hope is that these large ByteArray values will encode/compress to a
    > > fraction of their original size.  FWIW, cpp/src/parquet/
    > > column_writer.cc/.h has int64_t offset and length fields all over the
    > > place.
    > >
    > > External file references to BLOBS is doable but not the elegant,
    > > integrated solution I was hoping for.
    > >
    > > -Brian
    > >
    > > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
    > >
    > > *EXTERNAL*
    > > Looks like we will need a new encoding for this:
    > > https://github.com/apache/parquet-format/blob/master/Encodings.md
    > >
    > > That doc specifies that the plain encoding uses a 4-byte length. That's
    > > not going to be a quick fix.
    > >
    > > Now that I'm thinking about this a bit more, does it make sense to support
    > > byte arrays that are more than 2GB? That's far larger than the size of a
    > > row group, let alone a page. This would completely break memory management
    > > in the JVM implementation.
    > >
    > > Can you solve this problem using a BLOB type that references an external
    > > file with the gigantic values? Seems to me that values this large should go
    > > in separate files, not in a Parquet file where it would destroy any benefit
    > > from using the format.
    > >
    > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
    > >
    > >> Hello Ryan,
    > >>
    > >> Looks like it's limited by both the Parquet implementation and the Thrift
    > >> message methods.  Am I missing anything?
    > >>
    > >> From cpp/src/parquet/types.h
    > >>
    > >> struct ByteArray {
    > >>   ByteArray() : len(0), ptr(NULLPTR) {}
    > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
    > >>   uint32_t len;
    > >>   const uint8_t* ptr;
    > >> };
    > >>
    > >> From cpp/src/parquet/thrift.h
    > >>
    > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
    > >> deserialized_msg) {
    > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
    > >> out)
    > >>
    > >> -Brian
    > >>
    > >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
    > >>
    > >>     EXTERNAL
    > >>
    > >>     Hi Brian,
    > >>
    > >>     This seems like something we should allow. What imposes the current
    > >> limit?
    > >>     Is it in the thrift format, or just the implementations?
    > >>
    > >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
    > >> wrote:
    > >>
    > >>     > All,
    > >>     >
    > >>     > SAS requires support for storing varying-length character and
    > >> binary blobs
    > >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
    > >> field is
    > >>     > a unint32_t.   Looks this the will require incrementing the Parquet
    > >> file
    > >>     > format version and changing ByteArray len to uint64_t.
    > >>     >
    > >>     > Have there been any requests for this or other Parquet developments
    > >> that
    > >>     > require file format versioning changes?
    > >>     >
    > >>     > I realize this a non-trivial ask.  Thanks for considering it.
    > >>     >
    > >>     > -Brian
    > >>     >
    > >>
    > >>
    > >>     --
    > >>     Ryan Blue
    > >>     Software Engineer
    > >>     Netflix
    > >>
    > >>
    > >>
    > >
    > > --
    > > Ryan Blue
    > > Software Engineer
    > > Netflix
    > >
    > >
    >
    > --
    > Ryan Blue
    > Software Engineer
    > Netflix
    


Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Wes McKinney <we...@gmail.com>.
hi Brian,

Just to comment from the C++ side -- the 64-bit issue is a limitation
of the Parquet format itself and not related to the C++
implementation. It would be possibly interesting to add a
LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
doing much the same in Apache Arrow for in-memory)

- Wes

On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>
> I don't think that's what you would want to do. Parquet will eventually
> compress large values, but not after making defensive copies and attempting
> to encode them. In the end, it will be a lot more overhead, plus the work
> to make it possible. I think you'd be much better of compressing before
> storing in Parquet if you expect good compression rates.
>
> On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:
>
> > My hope is that these large ByteArray values will encode/compress to a
> > fraction of their original size.  FWIW, cpp/src/parquet/
> > column_writer.cc/.h has int64_t offset and length fields all over the
> > place.
> >
> > External file references to BLOBS is doable but not the elegant,
> > integrated solution I was hoping for.
> >
> > -Brian
> >
> > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
> >
> > *EXTERNAL*
> > Looks like we will need a new encoding for this:
> > https://github.com/apache/parquet-format/blob/master/Encodings.md
> >
> > That doc specifies that the plain encoding uses a 4-byte length. That's
> > not going to be a quick fix.
> >
> > Now that I'm thinking about this a bit more, does it make sense to support
> > byte arrays that are more than 2GB? That's far larger than the size of a
> > row group, let alone a page. This would completely break memory management
> > in the JVM implementation.
> >
> > Can you solve this problem using a BLOB type that references an external
> > file with the gigantic values? Seems to me that values this large should go
> > in separate files, not in a Parquet file where it would destroy any benefit
> > from using the format.
> >
> > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
> >
> >> Hello Ryan,
> >>
> >> Looks like it's limited by both the Parquet implementation and the Thrift
> >> message methods.  Am I missing anything?
> >>
> >> From cpp/src/parquet/types.h
> >>
> >> struct ByteArray {
> >>   ByteArray() : len(0), ptr(NULLPTR) {}
> >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
> >>   uint32_t len;
> >>   const uint8_t* ptr;
> >> };
> >>
> >> From cpp/src/parquet/thrift.h
> >>
> >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> >> deserialized_msg) {
> >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
> >> out)
> >>
> >> -Brian
> >>
> >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
> >>
> >>     EXTERNAL
> >>
> >>     Hi Brian,
> >>
> >>     This seems like something we should allow. What imposes the current
> >> limit?
> >>     Is it in the thrift format, or just the implementations?
> >>
> >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
> >> wrote:
> >>
> >>     > All,
> >>     >
> >>     > SAS requires support for storing varying-length character and
> >> binary blobs
> >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> >> field is
> >>     > a unint32_t.   Looks this the will require incrementing the Parquet
> >> file
> >>     > format version and changing ByteArray len to uint64_t.
> >>     >
> >>     > Have there been any requests for this or other Parquet developments
> >> that
> >>     > require file format versioning changes?
> >>     >
> >>     > I realize this a non-trivial ask.  Thanks for considering it.
> >>     >
> >>     > -Brian
> >>     >
> >>
> >>
> >>     --
> >>     Ryan Blue
> >>     Software Engineer
> >>     Netflix
> >>
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
Thanks Ryan,

After further pondering this, I came to similar conclusions.

Compress the data before putting it into a Parquet ByteArray and if that’s not feasible reference it in an external/persisted data structure

Another alternative is to create one or more “shadow columns” to store the overflow horizontally.

-Brian

On Apr 5, 2019, at 3:11 PM, Ryan Blue <rb...@netflix.com>> wrote:


EXTERNAL

I don't think that's what you would want to do. Parquet will eventually compress large values, but not after making defensive copies and attempting to encode them. In the end, it will be a lot more overhead, plus the work to make it possible. I think you'd be much better of compressing before storing in Parquet if you expect good compression rates.

On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com>> wrote:
My hope is that these large ByteArray values will encode/compress to a fraction of their original size.  FWIW, cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte arrays that are more than 2GB? That's far larger than the size of a row group, let alone a page. This would completely break memory management in the JVM implementation.

Can you solve this problem using a BLOB type that references an external file with the gigantic values? Seems to me that values this large should go in separate files, not in a Parquet file where it would destroy any benefit from using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID>> wrote:

    EXTERNAL

    Hi Brian,

    This seems like something we should allow. What imposes the current limit?
    Is it in the thrift format, or just the implementations?

    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>> wrote:

    > All,
    >
    > SAS requires support for storing varying-length character and binary blobs
    > with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
    > a unint32_t.   Looks this the will require incrementing the Parquet file
    > format version and changing ByteArray len to uint64_t.
    >
    > Have there been any requests for this or other Parquet developments that
    > require file format versioning changes?
    >
    > I realize this a non-trivial ask.  Thanks for considering it.
    >
    > -Brian
    >


    --
    Ryan Blue
    Software Engineer
    Netflix




--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
Thanks Ryan,

After further pondering this, I came to similar conclusions.

Compress the data before putting it into a Parquet ByteArray and if that’s not feasible reference it in an external/persisted data structure

Another alternative is to create one or more “shadow columns” to store the overflow horizontally.

-Brian

On Apr 5, 2019, at 3:11 PM, Ryan Blue <rb...@netflix.com>> wrote:


EXTERNAL

I don't think that's what you would want to do. Parquet will eventually compress large values, but not after making defensive copies and attempting to encode them. In the end, it will be a lot more overhead, plus the work to make it possible. I think you'd be much better of compressing before storing in Parquet if you expect good compression rates.

On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com>> wrote:
My hope is that these large ByteArray values will encode/compress to a fraction of their original size.  FWIW, cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte arrays that are more than 2GB? That's far larger than the size of a row group, let alone a page. This would completely break memory management in the JVM implementation.

Can you solve this problem using a BLOB type that references an external file with the gigantic values? Seems to me that values this large should go in separate files, not in a Parquet file where it would destroy any benefit from using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID>> wrote:

    EXTERNAL

    Hi Brian,

    This seems like something we should allow. What imposes the current limit?
    Is it in the thrift format, or just the implementations?

    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>> wrote:

    > All,
    >
    > SAS requires support for storing varying-length character and binary blobs
    > with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
    > a unint32_t.   Looks this the will require incrementing the Parquet file
    > format version and changing ByteArray len to uint64_t.
    >
    > Have there been any requests for this or other Parquet developments that
    > require file format versioning changes?
    >
    > I realize this a non-trivial ask.  Thanks for considering it.
    >
    > -Brian
    >


    --
    Ryan Blue
    Software Engineer
    Netflix




--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Wes McKinney <we...@gmail.com>.
hi Brian,

Just to comment from the C++ side -- the 64-bit issue is a limitation
of the Parquet format itself and not related to the C++
implementation. It would be possibly interesting to add a
LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
doing much the same in Apache Arrow for in-memory)

- Wes

On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>
> I don't think that's what you would want to do. Parquet will eventually
> compress large values, but not after making defensive copies and attempting
> to encode them. In the end, it will be a lot more overhead, plus the work
> to make it possible. I think you'd be much better of compressing before
> storing in Parquet if you expect good compression rates.
>
> On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:
>
> > My hope is that these large ByteArray values will encode/compress to a
> > fraction of their original size.  FWIW, cpp/src/parquet/
> > column_writer.cc/.h has int64_t offset and length fields all over the
> > place.
> >
> > External file references to BLOBS is doable but not the elegant,
> > integrated solution I was hoping for.
> >
> > -Brian
> >
> > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
> >
> > *EXTERNAL*
> > Looks like we will need a new encoding for this:
> > https://github.com/apache/parquet-format/blob/master/Encodings.md
> >
> > That doc specifies that the plain encoding uses a 4-byte length. That's
> > not going to be a quick fix.
> >
> > Now that I'm thinking about this a bit more, does it make sense to support
> > byte arrays that are more than 2GB? That's far larger than the size of a
> > row group, let alone a page. This would completely break memory management
> > in the JVM implementation.
> >
> > Can you solve this problem using a BLOB type that references an external
> > file with the gigantic values? Seems to me that values this large should go
> > in separate files, not in a Parquet file where it would destroy any benefit
> > from using the format.
> >
> > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
> >
> >> Hello Ryan,
> >>
> >> Looks like it's limited by both the Parquet implementation and the Thrift
> >> message methods.  Am I missing anything?
> >>
> >> From cpp/src/parquet/types.h
> >>
> >> struct ByteArray {
> >>   ByteArray() : len(0), ptr(NULLPTR) {}
> >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
> >>   uint32_t len;
> >>   const uint8_t* ptr;
> >> };
> >>
> >> From cpp/src/parquet/thrift.h
> >>
> >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> >> deserialized_msg) {
> >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
> >> out)
> >>
> >> -Brian
> >>
> >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
> >>
> >>     EXTERNAL
> >>
> >>     Hi Brian,
> >>
> >>     This seems like something we should allow. What imposes the current
> >> limit?
> >>     Is it in the thrift format, or just the implementations?
> >>
> >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
> >> wrote:
> >>
> >>     > All,
> >>     >
> >>     > SAS requires support for storing varying-length character and
> >> binary blobs
> >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> >> field is
> >>     > a unint32_t.   Looks this the will require incrementing the Parquet
> >> file
> >>     > format version and changing ByteArray len to uint64_t.
> >>     >
> >>     > Have there been any requests for this or other Parquet developments
> >> that
> >>     > require file format versioning changes?
> >>     >
> >>     > I realize this a non-trivial ask.  Thanks for considering it.
> >>     >
> >>     > -Brian
> >>     >
> >>
> >>
> >>     --
> >>     Ryan Blue
> >>     Software Engineer
> >>     Netflix
> >>
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I don't think that's what you would want to do. Parquet will eventually
compress large values, but not after making defensive copies and attempting
to encode them. In the end, it will be a lot more overhead, plus the work
to make it possible. I think you'd be much better of compressing before
storing in Parquet if you expect good compression rates.

On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:

> My hope is that these large ByteArray values will encode/compress to a
> fraction of their original size.  FWIW, cpp/src/parquet/
> column_writer.cc/.h has int64_t offset and length fields all over the
> place.
>
> External file references to BLOBS is doable but not the elegant,
> integrated solution I was hoping for.
>
> -Brian
>
> On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
>
> *EXTERNAL*
> Looks like we will need a new encoding for this:
> https://github.com/apache/parquet-format/blob/master/Encodings.md
>
> That doc specifies that the plain encoding uses a 4-byte length. That's
> not going to be a quick fix.
>
> Now that I'm thinking about this a bit more, does it make sense to support
> byte arrays that are more than 2GB? That's far larger than the size of a
> row group, let alone a page. This would completely break memory management
> in the JVM implementation.
>
> Can you solve this problem using a BLOB type that references an external
> file with the gigantic values? Seems to me that values this large should go
> in separate files, not in a Parquet file where it would destroy any benefit
> from using the format.
>
> On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
>
>> Hello Ryan,
>>
>> Looks like it's limited by both the Parquet implementation and the Thrift
>> message methods.  Am I missing anything?
>>
>> From cpp/src/parquet/types.h
>>
>> struct ByteArray {
>>   ByteArray() : len(0), ptr(NULLPTR) {}
>>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
>>   uint32_t len;
>>   const uint8_t* ptr;
>> };
>>
>> From cpp/src/parquet/thrift.h
>>
>> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
>> deserialized_msg) {
>> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
>> out)
>>
>> -Brian
>>
>> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
>>
>>     EXTERNAL
>>
>>     Hi Brian,
>>
>>     This seems like something we should allow. What imposes the current
>> limit?
>>     Is it in the thrift format, or just the implementations?
>>
>>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
>> wrote:
>>
>>     > All,
>>     >
>>     > SAS requires support for storing varying-length character and
>> binary blobs
>>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
>> field is
>>     > a unint32_t.   Looks this the will require incrementing the Parquet
>> file
>>     > format version and changing ByteArray len to uint64_t.
>>     >
>>     > Have there been any requests for this or other Parquet developments
>> that
>>     > require file format versioning changes?
>>     >
>>     > I realize this a non-trivial ask.  Thanks for considering it.
>>     >
>>     > -Brian
>>     >
>>
>>
>>     --
>>     Ryan Blue
>>     Software Engineer
>>     Netflix
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I don't think that's what you would want to do. Parquet will eventually
compress large values, but not after making defensive copies and attempting
to encode them. In the end, it will be a lot more overhead, plus the work
to make it possible. I think you'd be much better of compressing before
storing in Parquet if you expect good compression rates.

On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <Br...@sas.com> wrote:

> My hope is that these large ByteArray values will encode/compress to a
> fraction of their original size.  FWIW, cpp/src/parquet/
> column_writer.cc/.h has int64_t offset and length fields all over the
> place.
>
> External file references to BLOBS is doable but not the elegant,
> integrated solution I was hoping for.
>
> -Brian
>
> On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
>
> *EXTERNAL*
> Looks like we will need a new encoding for this:
> https://github.com/apache/parquet-format/blob/master/Encodings.md
>
> That doc specifies that the plain encoding uses a 4-byte length. That's
> not going to be a quick fix.
>
> Now that I'm thinking about this a bit more, does it make sense to support
> byte arrays that are more than 2GB? That's far larger than the size of a
> row group, let alone a page. This would completely break memory management
> in the JVM implementation.
>
> Can you solve this problem using a BLOB type that references an external
> file with the gigantic values? Seems to me that values this large should go
> in separate files, not in a Parquet file where it would destroy any benefit
> from using the format.
>
> On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:
>
>> Hello Ryan,
>>
>> Looks like it's limited by both the Parquet implementation and the Thrift
>> message methods.  Am I missing anything?
>>
>> From cpp/src/parquet/types.h
>>
>> struct ByteArray {
>>   ByteArray() : len(0), ptr(NULLPTR) {}
>>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
>>   uint32_t len;
>>   const uint8_t* ptr;
>> };
>>
>> From cpp/src/parquet/thrift.h
>>
>> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
>> deserialized_msg) {
>> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
>> out)
>>
>> -Brian
>>
>> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
>>
>>     EXTERNAL
>>
>>     Hi Brian,
>>
>>     This seems like something we should allow. What imposes the current
>> limit?
>>     Is it in the thrift format, or just the implementations?
>>
>>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
>> wrote:
>>
>>     > All,
>>     >
>>     > SAS requires support for storing varying-length character and
>> binary blobs
>>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
>> field is
>>     > a unint32_t.   Looks this the will require incrementing the Parquet
>> file
>>     > format version and changing ByteArray len to uint64_t.
>>     >
>>     > Have there been any requests for this or other Parquet developments
>> that
>>     > require file format versioning changes?
>>     >
>>     > I realize this a non-trivial ask.  Thanks for considering it.
>>     >
>>     > -Brian
>>     >
>>
>>
>>     --
>>     Ryan Blue
>>     Software Engineer
>>     Netflix
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
My hope is that these large ByteArray values will encode/compress to a fraction of their original size.  FWIW, cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte arrays that are more than 2GB? That's far larger than the size of a row group, let alone a page. This would completely break memory management in the JVM implementation.

Can you solve this problem using a BLOB type that references an external file with the gigantic values? Seems to me that values this large should go in separate files, not in a Parquet file where it would destroy any benefit from using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID>> wrote:

    EXTERNAL

    Hi Brian,

    This seems like something we should allow. What imposes the current limit?
    Is it in the thrift format, or just the implementations?

    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>> wrote:

    > All,
    >
    > SAS requires support for storing varying-length character and binary blobs
    > with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
    > a unint32_t.   Looks this the will require incrementing the Parquet file
    > format version and changing ByteArray len to uint64_t.
    >
    > Have there been any requests for this or other Parquet developments that
    > require file format versioning changes?
    >
    > I realize this a non-trivial ask.  Thanks for considering it.
    >
    > -Brian
    >


    --
    Ryan Blue
    Software Engineer
    Netflix




--
Ryan Blue
Software Engineer
Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
My hope is that these large ByteArray values will encode/compress to a fraction of their original size.  FWIW, cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte arrays that are more than 2GB? That's far larger than the size of a row group, let alone a page. This would completely break memory management in the JVM implementation.

Can you solve this problem using a BLOB type that references an external file with the gigantic values? Seems to me that values this large should go in separate files, not in a Parquet file where it would destroy any benefit from using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID>> wrote:

    EXTERNAL

    Hi Brian,

    This seems like something we should allow. What imposes the current limit?
    Is it in the thrift format, or just the implementations?

    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>> wrote:

    > All,
    >
    > SAS requires support for storing varying-length character and binary blobs
    > with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
    > a unint32_t.   Looks this the will require incrementing the Parquet file
    > format version and changing ByteArray len to uint64_t.
    >
    > Have there been any requests for this or other Parquet developments that
    > require file format versioning changes?
    >
    > I realize this a non-trivial ask.  Thanks for considering it.
    >
    > -Brian
    >


    --
    Ryan Blue
    Software Engineer
    Netflix




--
Ryan Blue
Software Engineer
Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Looks like we will need a new encoding for this:
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support
byte arrays that are more than 2GB? That's far larger than the size of a
row group, let alone a page. This would completely break memory management
in the JVM implementation.

Can you solve this problem using a BLOB type that references an external
file with the gigantic values? Seems to me that values this large should go
in separate files, not in a Parquet file where it would destroy any benefit
from using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:

> Hello Ryan,
>
> Looks like it's limited by both the Parquet implementation and the Thrift
> message methods.  Am I missing anything?
>
> From cpp/src/parquet/types.h
>
> struct ByteArray {
>   ByteArray() : len(0), ptr(NULLPTR) {}
>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
>   uint32_t len;
>   const uint8_t* ptr;
> };
>
> From cpp/src/parquet/thrift.h
>
> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> deserialized_msg) {
> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)
>
> -Brian
>
> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
>
>     EXTERNAL
>
>     Hi Brian,
>
>     This seems like something we should allow. What imposes the current
> limit?
>     Is it in the thrift format, or just the implementations?
>
>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
> wrote:
>
>     > All,
>     >
>     > SAS requires support for storing varying-length character and binary
> blobs
>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> field is
>     > a unint32_t.   Looks this the will require incrementing the Parquet
> file
>     > format version and changing ByteArray len to uint64_t.
>     >
>     > Have there been any requests for this or other Parquet developments
> that
>     > require file format versioning changes?
>     >
>     > I realize this a non-trivial ask.  Thanks for considering it.
>     >
>     > -Brian
>     >
>
>
>     --
>     Ryan Blue
>     Software Engineer
>     Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Looks like we will need a new encoding for this:
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support
byte arrays that are more than 2GB? That's far larger than the size of a
row group, let alone a page. This would completely break memory management
in the JVM implementation.

Can you solve this problem using a BLOB type that references an external
file with the gigantic values? Seems to me that values this large should go
in separate files, not in a Parquet file where it would destroy any benefit
from using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <Br...@sas.com> wrote:

> Hello Ryan,
>
> Looks like it's limited by both the Parquet implementation and the Thrift
> message methods.  Am I missing anything?
>
> From cpp/src/parquet/types.h
>
> struct ByteArray {
>   ByteArray() : len(0), ptr(NULLPTR) {}
>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
>   uint32_t len;
>   const uint8_t* ptr;
> };
>
> From cpp/src/parquet/thrift.h
>
> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> deserialized_msg) {
> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)
>
> -Brian
>
> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
>
>     EXTERNAL
>
>     Hi Brian,
>
>     This seems like something we should allow. What imposes the current
> limit?
>     Is it in the thrift format, or just the implementations?
>
>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com>
> wrote:
>
>     > All,
>     >
>     > SAS requires support for storing varying-length character and binary
> blobs
>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> field is
>     > a unint32_t.   Looks this the will require incrementing the Parquet
> file
>     > format version and changing ByteArray len to uint64_t.
>     >
>     > Have there been any requests for this or other Parquet developments
> that
>     > require file format versioning changes?
>     >
>     > I realize this a non-trivial ask.  Thanks for considering it.
>     >
>     > -Brian
>     >
>
>
>     --
>     Ryan Blue
>     Software Engineer
>     Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift message methods.  Am I missing anything?

From cpp/src/parquet/types.h 

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out) 

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:

    EXTERNAL
    
    Hi Brian,
    
    This seems like something we should allow. What imposes the current limit?
    Is it in the thrift format, or just the implementations?
    
    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com> wrote:
    
    > All,
    >
    > SAS requires support for storing varying-length character and binary blobs
    > with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
    > a unint32_t.   Looks this the will require incrementing the Parquet file
    > format version and changing ByteArray len to uint64_t.
    >
    > Have there been any requests for this or other Parquet developments that
    > require file format versioning changes?
    >
    > I realize this a non-trivial ask.  Thanks for considering it.
    >
    > -Brian
    >
    
    
    --
    Ryan Blue
    Software Engineer
    Netflix
    


Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Brian Bowman <Br...@sas.com>.
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift message methods.  Am I missing anything?

From cpp/src/parquet/types.h 

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out) 

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:

    EXTERNAL
    
    Hi Brian,
    
    This seems like something we should allow. What imposes the current limit?
    Is it in the thrift format, or just the implementations?
    
    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com> wrote:
    
    > All,
    >
    > SAS requires support for storing varying-length character and binary blobs
    > with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
    > a unint32_t.   Looks this the will require incrementing the Parquet file
    > format version and changing ByteArray len to uint64_t.
    >
    > Have there been any requests for this or other Parquet developments that
    > require file format versioning changes?
    >
    > I realize this a non-trivial ask.  Thanks for considering it.
    >
    > -Brian
    >
    
    
    --
    Ryan Blue
    Software Engineer
    Netflix
    


Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com> wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Need 64-bit Integer length for Parquet ByteArray Type

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <Br...@sas.com> wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


-- 
Ryan Blue
Software Engineer
Netflix