You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Andrei Gudkov <gu...@gmail.com> on 2018/01/03 09:04:58 UTC

[PARQUET-CPP] Writing hierarchical schema to a parquet

We would like to use a combination of Arrow and Parquet to store JSON-like 
hierarchical data. We have a problem of understanding how to properly 
serialize it.

Our current workflow:
1. We create hierarchical arrow::Schema
2. Then we create matching arrow::RecordBatchBuilder (with 
arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of 
ArrayBuilders of various types
3. Then we serialize all our documents one by one into RecordBatchBuilder by 
walking simultanously through a document and ArrayBuilder hierarchies.
5. Then we convert resulting RecordBatch to a Table and try to save it to 
parquet file with parquet::arrow::FileWriter::WriteTable().
   
But at this moment serialization fails with an error "Invalid: Nested column 
branch had multiple children". We also tried to avoid converting to a Table 
and save root column (StructArray) directly with 
parquet::arrow::FileWriter::WriteColumnChunk with the same result.

By looking at writer.cc code, it seems that it expects a flat list of columns. 
So, there should be step #4 that converts a hierachical RecordBatch to a flat 
RecordBatch. For example, such hierarchical schema

struct {
  struct {
    int64;
    list {
      string;
    }
  }
  float;
}

should be flattened into such flat schema consisting of three top-level fields:

struct {
  struct {
    int64;
  }
},
struct {
  struct {
    list {
      string;
    }
  }
},
struct {
  float;
}

I am curious whether we are going in the right direction. If yes, do we need 
to write converter manually or is there any existing code that does that?

We use master::HEAD versions of Arrow and Parquet.

Re: [PARQUET-CPP] Writing hierarchical schema to a parquet

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I recently started working on a new set of value readers in Java that
support hierarchical schemas. I ended up with some code that's a lot easier
to read than the current Java version, and is slightly faster (at least for
my Avro tests). It may be helpful for this work on the C++ side.

Here's the list reader implementation:
https://github.com/Netflix/iceberg/blob/parquet-value-readers/parquet/src/main/java/com/netflix/iceberg/parquet/ParquetValueReaders.java#L172

rb

On Wed, Jan 17, 2018 at 9:41 AM, Wes McKinney <we...@gmail.com> wrote:

> This work would only involve the Arrow interface in src/parquet/arrow
> (converting from Arrow representation to repetition/definition level
> encoding, and back), so you wouldn't need to master the whole Parquet
> codebase, at least. I'd like to help with this work, but realistically
> I won't have bandwidth for it until February or more likely March
> sometime.
>
> - Wes
>
> On Wed, Jan 17, 2018 at 10:11 AM, Jim Pivarski <jp...@gmail.com>
> wrote:
> > I also have a use-case that requires lists-of-structs and encountered
> that
> > limitation in pyarrow. Just one level deep would enable a lot of HEP
> data.
> >
> > I've worked out the logic of converting Parquet definition and repetition
> > levels into Arrow-style arrays:
> >
> > https://github.com/diana-hep/oamap/blob/master/oamap/
> source/parquet.py#L604
> > https://github.com/diana-hep/oamap/blob/master/oamap/
> source/parquet.py#L238
> >
> >
> > which is subtle because record nullability and list lengths are
> > intertwined. (Repetition levels, by themselves, cannot encode empty
> lists,
> > so they do it through an interaction with definition levels.) I also
> have a
> > suite of artificial samples that test combinations of these features:
> >
> > https://github.com/diana-hep/oamap/tree/master/tests/samples
> >
> >
> > It's hard for me to imagine diving into a new codebase (Parquet C++) and
> > adding this feature on my own, but I'd be willing to work with someone
> who
> > is familiar with it, knows which regions of the code need to be changed,
> > and can work in parallel with me remotely. The translation from
> intertwined
> > definition and repetition levels to Arrow's separate arrays for each
> level
> > of structure was not easy, and I'd like to spread this knowledge now that
> > my implementation seems to work.
> >
> > Anyone interested in teaming up?
> > -- Jim
> >
> >
> >
> > On Wed, Jan 10, 2018 at 7:36 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> hi Andrei,
> >>
> >> We are in need of development assistance in the Parquet C++ project
> >> (https://github.com/apache/parquet-cpp) implementing complete support
> >> for reading and writing nested Arrow data. We only support simple
> >> structs (and structs of structs) and lists (and lists of lists) at the
> >> moment. It's something I'd like to get done in 2018 if no one else
> >> gets there first, but it isn't enough of a priority for me personally
> >> right now to guarantee any kind of timeline.
> >>
> >> Thanks
> >> Wes
> >>
> >> On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <gu...@gmail.com> wrote:
> >> > We would like to use a combination of Arrow and Parquet to store
> >> JSON-like
> >> > hierarchical data. We have a problem of understanding how to properly
> >> > serialize it.
> >> >
> >> > Our current workflow:
> >> > 1. We create hierarchical arrow::Schema
> >> > 2. Then we create matching arrow::RecordBatchBuilder (with
> >> > arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy
> of
> >> > ArrayBuilders of various types
> >> > 3. Then we serialize all our documents one by one into
> >> RecordBatchBuilder by
> >> > walking simultanously through a document and ArrayBuilder hierarchies.
> >> > 5. Then we convert resulting RecordBatch to a Table and try to save
> it to
> >> > parquet file with parquet::arrow::FileWriter::WriteTable().
> >> >
> >> > But at this moment serialization fails with an error "Invalid: Nested
> >> column
> >> > branch had multiple children". We also tried to avoid converting to a
> >> Table
> >> > and save root column (StructArray) directly with
> >> > parquet::arrow::FileWriter::WriteColumnChunk with the same result.
> >> >
> >> > By looking at writer.cc code, it seems that it expects a flat list of
> >> columns.
> >> > So, there should be step #4 that converts a hierachical RecordBatch
> to a
> >> flat
> >> > RecordBatch. For example, such hierarchical schema
> >> >
> >> > struct {
> >> >   struct {
> >> >     int64;
> >> >     list {
> >> >       string;
> >> >     }
> >> >   }
> >> >   float;
> >> > }
> >> >
> >> > should be flattened into such flat schema consisting of three
> top-level
> >> fields:
> >> >
> >> > struct {
> >> >   struct {
> >> >     int64;
> >> >   }
> >> > },
> >> > struct {
> >> >   struct {
> >> >     list {
> >> >       string;
> >> >     }
> >> >   }
> >> > },
> >> > struct {
> >> >   float;
> >> > }
> >> >
> >> > I am curious whether we are going in the right direction. If yes, do
> we
> >> need
> >> > to write converter manually or is there any existing code that does
> that?
> >> >
> >> > We use master::HEAD versions of Arrow and Parquet.
> >> >
> >> >
> >> >
> >>
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: [PARQUET-CPP] Writing hierarchical schema to a parquet

Posted by Wes McKinney <we...@gmail.com>.

This work would only involve the Arrow interface in src/parquet/arrow
(converting from Arrow representation to repetition/definition level
encoding, and back), so you wouldn't need to master the whole Parquet
codebase, at least. I'd like to help with this work, but realistically
I won't have bandwidth for it until February or more likely March
sometime.

- Wes

On Wed, Jan 17, 2018 at 10:11 AM, Jim Pivarski <jp...@gmail.com> wrote:
> I also have a use-case that requires lists-of-structs and encountered that
> limitation in pyarrow. Just one level deep would enable a lot of HEP data.
>
> I've worked out the logic of converting Parquet definition and repetition
> levels into Arrow-style arrays:
>
> https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L604
> https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L238
>
>
> which is subtle because record nullability and list lengths are
> intertwined. (Repetition levels, by themselves, cannot encode empty lists,
> so they do it through an interaction with definition levels.) I also have a
> suite of artificial samples that test combinations of these features:
>
> https://github.com/diana-hep/oamap/tree/master/tests/samples
>
>
> It's hard for me to imagine diving into a new codebase (Parquet C++) and
> adding this feature on my own, but I'd be willing to work with someone who
> is familiar with it, knows which regions of the code need to be changed,
> and can work in parallel with me remotely. The translation from intertwined
> definition and repetition levels to Arrow's separate arrays for each level
> of structure was not easy, and I'd like to spread this knowledge now that
> my implementation seems to work.
>
> Anyone interested in teaming up?
> -- Jim
>
>
>
> On Wed, Jan 10, 2018 at 7:36 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Andrei,
>>
>> We are in need of development assistance in the Parquet C++ project
>> (https://github.com/apache/parquet-cpp) implementing complete support
>> for reading and writing nested Arrow data. We only support simple
>> structs (and structs of structs) and lists (and lists of lists) at the
>> moment. It's something I'd like to get done in 2018 if no one else
>> gets there first, but it isn't enough of a priority for me personally
>> right now to guarantee any kind of timeline.
>>
>> Thanks
>> Wes
>>
>> On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <gu...@gmail.com> wrote:
>> > We would like to use a combination of Arrow and Parquet to store
>> JSON-like
>> > hierarchical data. We have a problem of understanding how to properly
>> > serialize it.
>> >
>> > Our current workflow:
>> > 1. We create hierarchical arrow::Schema
>> > 2. Then we create matching arrow::RecordBatchBuilder (with
>> > arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of
>> > ArrayBuilders of various types
>> > 3. Then we serialize all our documents one by one into
>> RecordBatchBuilder by
>> > walking simultanously through a document and ArrayBuilder hierarchies.
>> > 5. Then we convert resulting RecordBatch to a Table and try to save it to
>> > parquet file with parquet::arrow::FileWriter::WriteTable().
>> >
>> > But at this moment serialization fails with an error "Invalid: Nested
>> column
>> > branch had multiple children". We also tried to avoid converting to a
>> Table
>> > and save root column (StructArray) directly with
>> > parquet::arrow::FileWriter::WriteColumnChunk with the same result.
>> >
>> > By looking at writer.cc code, it seems that it expects a flat list of
>> columns.
>> > So, there should be step #4 that converts a hierachical RecordBatch to a
>> flat
>> > RecordBatch. For example, such hierarchical schema
>> >
>> > struct {
>> >   struct {
>> >     int64;
>> >     list {
>> >       string;
>> >     }
>> >   }
>> >   float;
>> > }
>> >
>> > should be flattened into such flat schema consisting of three top-level
>> fields:
>> >
>> > struct {
>> >   struct {
>> >     int64;
>> >   }
>> > },
>> > struct {
>> >   struct {
>> >     list {
>> >       string;
>> >     }
>> >   }
>> > },
>> > struct {
>> >   float;
>> > }
>> >
>> > I am curious whether we are going in the right direction. If yes, do we
>> need
>> > to write converter manually or is there any existing code that does that?
>> >
>> > We use master::HEAD versions of Arrow and Parquet.
>> >
>> >
>> >
>>

Re: [PARQUET-CPP] Writing hierarchical schema to a parquet

Posted by Jim Pivarski <jp...@gmail.com>.

I also have a use-case that requires lists-of-structs and encountered that
limitation in pyarrow. Just one level deep would enable a lot of HEP data.

I've worked out the logic of converting Parquet definition and repetition
levels into Arrow-style arrays:

https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L604
https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L238


which is subtle because record nullability and list lengths are
intertwined. (Repetition levels, by themselves, cannot encode empty lists,
so they do it through an interaction with definition levels.) I also have a
suite of artificial samples that test combinations of these features:

https://github.com/diana-hep/oamap/tree/master/tests/samples


It's hard for me to imagine diving into a new codebase (Parquet C++) and
adding this feature on my own, but I'd be willing to work with someone who
is familiar with it, knows which regions of the code need to be changed,
and can work in parallel with me remotely. The translation from intertwined
definition and repetition levels to Arrow's separate arrays for each level
of structure was not easy, and I'd like to spread this knowledge now that
my implementation seems to work.

Anyone interested in teaming up?
-- Jim



On Wed, Jan 10, 2018 at 7:36 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Andrei,
>
> We are in need of development assistance in the Parquet C++ project
> (https://github.com/apache/parquet-cpp) implementing complete support
> for reading and writing nested Arrow data. We only support simple
> structs (and structs of structs) and lists (and lists of lists) at the
> moment. It's something I'd like to get done in 2018 if no one else
> gets there first, but it isn't enough of a priority for me personally
> right now to guarantee any kind of timeline.
>
> Thanks
> Wes
>
> On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <gu...@gmail.com> wrote:
> > We would like to use a combination of Arrow and Parquet to store
> JSON-like
> > hierarchical data. We have a problem of understanding how to properly
> > serialize it.
> >
> > Our current workflow:
> > 1. We create hierarchical arrow::Schema
> > 2. Then we create matching arrow::RecordBatchBuilder (with
> > arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of
> > ArrayBuilders of various types
> > 3. Then we serialize all our documents one by one into
> RecordBatchBuilder by
> > walking simultanously through a document and ArrayBuilder hierarchies.
> > 5. Then we convert resulting RecordBatch to a Table and try to save it to
> > parquet file with parquet::arrow::FileWriter::WriteTable().
> >
> > But at this moment serialization fails with an error "Invalid: Nested
> column
> > branch had multiple children". We also tried to avoid converting to a
> Table
> > and save root column (StructArray) directly with
> > parquet::arrow::FileWriter::WriteColumnChunk with the same result.
> >
> > By looking at writer.cc code, it seems that it expects a flat list of
> columns.
> > So, there should be step #4 that converts a hierachical RecordBatch to a
> flat
> > RecordBatch. For example, such hierarchical schema
> >
> > struct {
> >   struct {
> >     int64;
> >     list {
> >       string;
> >     }
> >   }
> >   float;
> > }
> >
> > should be flattened into such flat schema consisting of three top-level
> fields:
> >
> > struct {
> >   struct {
> >     int64;
> >   }
> > },
> > struct {
> >   struct {
> >     list {
> >       string;
> >     }
> >   }
> > },
> > struct {
> >   float;
> > }
> >
> > I am curious whether we are going in the right direction. If yes, do we
> need
> > to write converter manually or is there any existing code that does that?
> >
> > We use master::HEAD versions of Arrow and Parquet.
> >
> >
> >
>

Re: [PARQUET-CPP] Writing hierarchical schema to a parquet

Posted by Wes McKinney <we...@gmail.com>.

hi Andrei,

We are in need of development assistance in the Parquet C++ project
(https://github.com/apache/parquet-cpp) implementing complete support
for reading and writing nested Arrow data. We only support simple
structs (and structs of structs) and lists (and lists of lists) at the
moment. It's something I'd like to get done in 2018 if no one else
gets there first, but it isn't enough of a priority for me personally
right now to guarantee any kind of timeline.

Thanks
Wes

On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <gu...@gmail.com> wrote:
> We would like to use a combination of Arrow and Parquet to store JSON-like
> hierarchical data. We have a problem of understanding how to properly
> serialize it.
>
> Our current workflow:
> 1. We create hierarchical arrow::Schema
> 2. Then we create matching arrow::RecordBatchBuilder (with
> arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of
> ArrayBuilders of various types
> 3. Then we serialize all our documents one by one into RecordBatchBuilder by
> walking simultanously through a document and ArrayBuilder hierarchies.
> 5. Then we convert resulting RecordBatch to a Table and try to save it to
> parquet file with parquet::arrow::FileWriter::WriteTable().
>
> But at this moment serialization fails with an error "Invalid: Nested column
> branch had multiple children". We also tried to avoid converting to a Table
> and save root column (StructArray) directly with
> parquet::arrow::FileWriter::WriteColumnChunk with the same result.
>
> By looking at writer.cc code, it seems that it expects a flat list of columns.
> So, there should be step #4 that converts a hierachical RecordBatch to a flat
> RecordBatch. For example, such hierarchical schema
>
> struct {
>   struct {
>     int64;
>     list {
>       string;
>     }
>   }
>   float;
> }
>
> should be flattened into such flat schema consisting of three top-level fields:
>
> struct {
>   struct {
>     int64;
>   }
> },
> struct {
>   struct {
>     list {
>       string;
>     }
>   }
> },
> struct {
>   float;
> }
>
> I am curious whether we are going in the right direction. If yes, do we need
> to write converter manually or is there any existing code that does that?
>
> We use master::HEAD versions of Arrow and Parquet.
>
>
>