You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ishaan Joshi <is...@apache.org> on 2018/02/01 08:46:24 UTC
Writing nested parquet data using pyarrow
Wes and co.,
First off, great project ! I was able to read the docs and get going in
under a day, the APIs are super easy to use. That being said, I'm a tad
stuck, and having exhausted google-fu, am here to assistance. I want to use
pyarrow to write a nested dataset in parquet. The schema is quite complex,
and I'm having difficulty getting going with arrays for nested data
structures. For e.g, a column in my schema look like this:
In [7]: schema
Out[7]:
cstruct: struct<field1: double, field2: struct<field1: string>, field3:
list<item: int32>, field4: list<struct: struct<field1: int32>>>
child 0, field1: double
child 1, field2: struct<field1: string>
child 0, field1: string
child 2, field3: list<item: int32>
child 0, item: int32
child 3, field4: list<struct: struct<field1: int32>>
child 0, struct: struct<field1: int32>
child 0, field1: int32
How would I go constructing a row with this type? I've been looking at
StructArray and ListArray. I've found the following links during my
research:
* https://github.com/apache/arrow/issues/1217
*
https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python
*
https://github.com/apache/arrow/commit/5c704bce42e3fa71ea4586368962d41173b3e17b
I've managed to wrangle everything but ListArrays, e.g:
field1_data = pa.array([1.1], type=pa.float64())
field2_data = pa.StructArray.from_arrays(['field1'], [pa.array(['foo'],
type=pa.string())])
field3_data = pa.array([[1], [2]], type=pa.list_(pa.int32()))
I've having trouble with field4:
field4_struct = pa.StructArray.from_arrays(['field1'], [pa.array([1],
type=pa.int32())])
field4_data = pa.ListArray.from_arrays(??, field4_struct)
In particular, what does the offset value mean, and how do I populate it?
Thanks in advance for all the help.
-- Ishaan
Re: Writing nested parquet data using pyarrow
Posted by Wes McKinney <we...@gmail.com>.
hi Ishaan,
Full support for converting between Arrow's and Parquet's nested data
representation in
https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow
is not yet complete. I have no estimate on when the work will be
completed since I'm not sure who's going to do the work. I will
eventually work on it myself, but I have no urgent need, so if it's
me, it may be sometime later this year. I think it would be a really
interesting project for someone wishing to master both the Arrow
format and C++ API and Parquet nested data encoding.
Separate from that, we could definitely have much better documentation
about different ways to construct nested data in Python.
thanks
Wes
On Thu, Feb 1, 2018 at 3:46 AM, Ishaan Joshi <is...@apache.org> wrote:
> Wes and co.,
>
> First off, great project ! I was able to read the docs and get going in
> under a day, the APIs are super easy to use. That being said, I'm a tad
> stuck, and having exhausted google-fu, am here to assistance. I want to use
> pyarrow to write a nested dataset in parquet. The schema is quite complex,
> and I'm having difficulty getting going with arrays for nested data
> structures. For e.g, a column in my schema look like this:
>
> In [7]: schema
>
> Out[7]:
>
> cstruct: struct<field1: double, field2: struct<field1: string>, field3:
> list<item: int32>, field4: list<struct: struct<field1: int32>>>
>
> child 0, field1: double
>
> child 1, field2: struct<field1: string>
>
> child 0, field1: string
>
> child 2, field3: list<item: int32>
>
> child 0, item: int32
>
> child 3, field4: list<struct: struct<field1: int32>>
>
> child 0, struct: struct<field1: int32>
>
> child 0, field1: int32
>
> How would I go constructing a row with this type? I've been looking at
> StructArray and ListArray. I've found the following links during my
> research:
>
> * https://github.com/apache/arrow/issues/1217
>
> *
> https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python
>
> *
> https://github.com/apache/arrow/commit/5c704bce42e3fa71ea4586368962d41173b3e17b
>
> I've managed to wrangle everything but ListArrays, e.g:
>
> field1_data = pa.array([1.1], type=pa.float64())
>
> field2_data = pa.StructArray.from_arrays(['field1'], [pa.array(['foo'],
> type=pa.string())])
>
> field3_data = pa.array([[1], [2]], type=pa.list_(pa.int32()))
>
> I've having trouble with field4:
>
> field4_struct = pa.StructArray.from_arrays(['field1'], [pa.array([1],
> type=pa.int32())])
>
> field4_data = pa.ListArray.from_arrays(??, field4_struct)
>
> In particular, what does the offset value mean, and how do I populate it?
>
> Thanks in advance for all the help.
>
> -- Ishaan