You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ishaan Joshi <is...@apache.org> on 2018/02/01 08:46:24 UTC

Writing nested parquet data using pyarrow

Wes and co.,

First off, great project ! I was able to read the docs and get going in
under a day, the APIs are super easy to use. That being said, I'm a tad
stuck, and having exhausted google-fu, am here to assistance. I want to use
pyarrow to write a nested dataset in parquet. The schema is quite complex,
and I'm having difficulty getting going with arrays for nested data
structures. For e.g, a column in my schema look like this:

In [7]: schema

Out[7]:

cstruct: struct<field1: double, field2: struct<field1: string>, field3:
list<item: int32>, field4: list<struct: struct<field1: int32>>>

  child 0, field1: double

  child 1, field2: struct<field1: string>

      child 0, field1: string

  child 2, field3: list<item: int32>

      child 0, item: int32

  child 3, field4: list<struct: struct<field1: int32>>

      child 0, struct: struct<field1: int32>

          child 0, field1: int32

How would I go constructing a row with this type? I've been looking at
StructArray and ListArray. I've found the following links during my
research:

* https://github.com/apache/arrow/issues/1217

*
https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python

*
https://github.com/apache/arrow/commit/5c704bce42e3fa71ea4586368962d41173b3e17b

I've managed to wrangle everything but ListArrays, e.g:

field1_data = pa.array([1.1], type=pa.float64())

field2_data = pa.StructArray.from_arrays(['field1'], [pa.array(['foo'],
type=pa.string())])

field3_data = pa.array([[1], [2]], type=pa.list_(pa.int32()))

I've having trouble with field4:

field4_struct = pa.StructArray.from_arrays(['field1'], [pa.array([1],
type=pa.int32())])

field4_data = pa.ListArray.from_arrays(??, field4_struct)

In particular, what does the offset value mean, and how do I populate it?

Thanks in advance for all the help.

-- Ishaan

Re: Writing nested parquet data using pyarrow

Posted by Wes McKinney <we...@gmail.com>.
hi Ishaan,

Full support for converting between Arrow's and Parquet's nested data
representation in

https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow

is not yet complete. I have no estimate on when the work will be
completed since I'm not sure who's going to do the work. I will
eventually work on it myself, but I have no urgent need, so if it's
me, it may be sometime later this year. I think it would be a really
interesting project for someone wishing to master both the Arrow
format and C++ API and Parquet nested data encoding.

Separate from that, we could definitely have much better documentation
about different ways to construct nested data in Python.

thanks
Wes

On Thu, Feb 1, 2018 at 3:46 AM, Ishaan Joshi <is...@apache.org> wrote:
> Wes and co.,
>
> First off, great project ! I was able to read the docs and get going in
> under a day, the APIs are super easy to use. That being said, I'm a tad
> stuck, and having exhausted google-fu, am here to assistance. I want to use
> pyarrow to write a nested dataset in parquet. The schema is quite complex,
> and I'm having difficulty getting going with arrays for nested data
> structures. For e.g, a column in my schema look like this:
>
> In [7]: schema
>
> Out[7]:
>
> cstruct: struct<field1: double, field2: struct<field1: string>, field3:
> list<item: int32>, field4: list<struct: struct<field1: int32>>>
>
>   child 0, field1: double
>
>   child 1, field2: struct<field1: string>
>
>       child 0, field1: string
>
>   child 2, field3: list<item: int32>
>
>       child 0, item: int32
>
>   child 3, field4: list<struct: struct<field1: int32>>
>
>       child 0, struct: struct<field1: int32>
>
>           child 0, field1: int32
>
> How would I go constructing a row with this type? I've been looking at
> StructArray and ListArray. I've found the following links during my
> research:
>
> * https://github.com/apache/arrow/issues/1217
>
> *
> https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python
>
> *
> https://github.com/apache/arrow/commit/5c704bce42e3fa71ea4586368962d41173b3e17b
>
> I've managed to wrangle everything but ListArrays, e.g:
>
> field1_data = pa.array([1.1], type=pa.float64())
>
> field2_data = pa.StructArray.from_arrays(['field1'], [pa.array(['foo'],
> type=pa.string())])
>
> field3_data = pa.array([[1], [2]], type=pa.list_(pa.int32()))
>
> I've having trouble with field4:
>
> field4_struct = pa.StructArray.from_arrays(['field1'], [pa.array([1],
> type=pa.int32())])
>
> field4_data = pa.ListArray.from_arrays(??, field4_struct)
>
> In particular, what does the offset value mean, and how do I populate it?
>
> Thanks in advance for all the help.
>
> -- Ishaan