You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Adam Lippai <ad...@rigo.sk> on 2020/10/29 14:28:42 UTC

Serializing nested pandas dataframes

Hi,

is there a way to serialize (IPC) hierarchical tabular data (eg. output of
pandas groupby) in python?
I've tried to call pa.ipc.serialize_pandas() on this example, but it throws
error:
https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes

Best regards,
Adam Lippai

Re: Serializing nested pandas dataframes

Posted by Adam Lippai <ad...@rigo.sk>.
This is what I want to extend for multiple tables:
https://issues.apache.org/jira/browse/ARROW-10045?focusedCommentId=17207790&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17207790
I would need to come up with custom binary wrapper for multiple serialized
pyarrow tables and since Arrow supports hierarchical data to some level, I
was looking for built-in support of nested tables.
I understand this might not be available on API level.

Best regards,
Adam Lippai

On Thu, Oct 29, 2020 at 10:14 PM Adam Lippai <ad...@rigo.sk> wrote:

> If I have a DataFrame with columns Date, Category, Value and group by
> Category I'll have multiple DataFrames with Date, Value columns.
> The result of the groupby is DataFrameGroupBy, which can't be serialized.
> This is why I tried to assemble a nested DataFrame instead (like the one in
> the SO link previously), but that doesn't work either.
>
> As Apache Arrow JS doesn't support groupby (processing the original DF on
> the client-side), I was thinking of pushing the groupby operation to the
> server side (pyarrow), doing the groupby in pandas before serializing and
> sending it to the client.
> I was wondering whether this (nested arrow tables) is a supported feature
> or not (by calling chained table.toArray() or similar solution)
> Currently I process it in pure JS, it's not that ugly, but not really
> idiomatic either. The lack of Categorial data type and processing it row by
> row certainly has it's perf. price.
>
> Best regards,
> Adam Lippai
>
> On Thu, Oct 29, 2020 at 9:39 PM Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
>
>> Can you give a more specific example of what kind of hierarchical data
>> you want to serialize? (eg the output of a groupby operation in pandas
>> typically is still a dataframe that can be converted to pyarrow and
>> serialized).
>>
>> In general, for hierarchical data we have the nested data types (eg
>> struct type when you nest "multiple columns in a single column").
>>
>> Joris
>>
>>
>> On Thu, 29 Oct 2020 at 15:29, Adam Lippai <ad...@rigo.sk> wrote:
>> >
>> > Hi,
>> >
>> > is there a way to serialize (IPC) hierarchical tabular data (eg. output
>> of
>> > pandas groupby) in python?
>> > I've tried to call pa.ipc.serialize_pandas() on this example, but it
>> throws
>> > error:
>> > https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes
>> >
>> > Best regards,
>> > Adam Lippai
>>
>

Re: Serializing nested pandas dataframes

Posted by Adam Lippai <ad...@rigo.sk>.
If I have a DataFrame with columns Date, Category, Value and group by
Category I'll have multiple DataFrames with Date, Value columns.
The result of the groupby is DataFrameGroupBy, which can't be serialized.
This is why I tried to assemble a nested DataFrame instead (like the one in
the SO link previously), but that doesn't work either.

As Apache Arrow JS doesn't support groupby (processing the original DF on
the client-side), I was thinking of pushing the groupby operation to the
server side (pyarrow), doing the groupby in pandas before serializing and
sending it to the client.
I was wondering whether this (nested arrow tables) is a supported feature
or not (by calling chained table.toArray() or similar solution)
Currently I process it in pure JS, it's not that ugly, but not really
idiomatic either. The lack of Categorial data type and processing it row by
row certainly has it's perf. price.

Best regards,
Adam Lippai

On Thu, Oct 29, 2020 at 9:39 PM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Can you give a more specific example of what kind of hierarchical data
> you want to serialize? (eg the output of a groupby operation in pandas
> typically is still a dataframe that can be converted to pyarrow and
> serialized).
>
> In general, for hierarchical data we have the nested data types (eg
> struct type when you nest "multiple columns in a single column").
>
> Joris
>
>
> On Thu, 29 Oct 2020 at 15:29, Adam Lippai <ad...@rigo.sk> wrote:
> >
> > Hi,
> >
> > is there a way to serialize (IPC) hierarchical tabular data (eg. output
> of
> > pandas groupby) in python?
> > I've tried to call pa.ipc.serialize_pandas() on this example, but it
> throws
> > error:
> > https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes
> >
> > Best regards,
> > Adam Lippai
>

Re: Serializing nested pandas dataframes

Posted by Joris Van den Bossche <jo...@gmail.com>.
Can you give a more specific example of what kind of hierarchical data
you want to serialize? (eg the output of a groupby operation in pandas
typically is still a dataframe that can be converted to pyarrow and
serialized).

In general, for hierarchical data we have the nested data types (eg
struct type when you nest "multiple columns in a single column").

Joris


On Thu, 29 Oct 2020 at 15:29, Adam Lippai <ad...@rigo.sk> wrote:
>
> Hi,
>
> is there a way to serialize (IPC) hierarchical tabular data (eg. output of
> pandas groupby) in python?
> I've tried to call pa.ipc.serialize_pandas() on this example, but it throws
> error:
> https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes
>
> Best regards,
> Adam Lippai