You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Mitar <mm...@gmail.com> on 2018/10/19 04:10:30 UTC

Efficient Pandas serialization for mixed object and numeric DataFrames

Hi!

It seems that if a DataFrame contains both numeric and object columns,
the whole DataFrame is pickled and not that only object columns are
pickled? Is this right? Are there any plans to improve this?


Mitar

-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m

Re: Efficient Pandas serialization for mixed object and numeric DataFrames

Posted by Wes McKinney <we...@gmail.com>.
hi Mitar -- to Robert's point, we aren't sure which code path you are
referring to.

Perhaps related, I'm interested in handling Python pickling for
"other" kinds of Python objects when converting to or from the Arrow
format. So "Python object" would be defined as a user defined type
that's embedded in the Arrow BINARY type. The relevant JIRA for this
is https://issues.apache.org/jira/browse/ARROW-823

Thanks
Wes
On Fri, Oct 19, 2018 at 6:26 AM Antoine Pitrou <so...@pitrou.net> wrote:
>
>
> Slightly off-topic, but the recent work on PEP 574 (*) should allow
> efficient serialization of Pandas dataframes (**) with standard pickle
> (or the pickle5 backport).  Experimental support for pickle5 has
> already been merged in Arrow and Numpy (and Pandas uses Numpy as its
> storage backend).  My personal goal is to have the PEP accepted and
> integrated into Python 3.8.
>
> Regards
>
> Antoine.
>
> (*) Pickle protocol 5 with out-of-band data:
> https://www.python.org/dev/peps/pep-0574/
>
> (**) No-copy semantics for pandas dataframes:
> https://github.com/numpy/numpy/pull/12011#issuecomment-428915852
>
>
> On Thu, 18 Oct 2018 21:22:04 -0700
> Robert Nishihara <ro...@gmail.com> wrote:
> > How are you serializing the dataframe? If you use *pyarrow.serialize(df)*,
> > then each column should be serialized separately and numeric columns will
> > be handled efficiently.
> >
> > On Thu, Oct 18, 2018 at 9:10 PM Mitar <mm...@gmail.com> wrote:
> >
> > > Hi!
> > >
> > > It seems that if a DataFrame contains both numeric and object columns,
> > > the whole DataFrame is pickled and not that only object columns are
> > > pickled? Is this right? Are there any plans to improve this?
> > >
> > >
> > > Mitar
> > >
> > > --
> > > http://mitar.tnode.com/
> > > https://twitter.com/mitar_m
> > >
> >
>

Re: Efficient Pandas serialization for mixed object and numeric DataFrames

Posted by Antoine Pitrou <so...@pitrou.net>.
Slightly off-topic, but the recent work on PEP 574 (*) should allow
efficient serialization of Pandas dataframes (**) with standard pickle
(or the pickle5 backport).  Experimental support for pickle5 has
already been merged in Arrow and Numpy (and Pandas uses Numpy as its
storage backend).  My personal goal is to have the PEP accepted and
integrated into Python 3.8.

Regards

Antoine.

(*) Pickle protocol 5 with out-of-band data:
https://www.python.org/dev/peps/pep-0574/

(**) No-copy semantics for pandas dataframes:
https://github.com/numpy/numpy/pull/12011#issuecomment-428915852


On Thu, 18 Oct 2018 21:22:04 -0700
Robert Nishihara <ro...@gmail.com> wrote:
> How are you serializing the dataframe? If you use *pyarrow.serialize(df)*,
> then each column should be serialized separately and numeric columns will
> be handled efficiently.
> 
> On Thu, Oct 18, 2018 at 9:10 PM Mitar <mm...@gmail.com> wrote:
> 
> > Hi!
> >
> > It seems that if a DataFrame contains both numeric and object columns,
> > the whole DataFrame is pickled and not that only object columns are
> > pickled? Is this right? Are there any plans to improve this?
> >
> >
> > Mitar
> >
> > --
> > http://mitar.tnode.com/
> > https://twitter.com/mitar_m
> >  
> 


Re: Efficient Pandas serialization for mixed object and numeric DataFrames

Posted by Robert Nishihara <ro...@gmail.com>.
How are you serializing the dataframe? If you use *pyarrow.serialize(df)*,
then each column should be serialized separately and numeric columns will
be handled efficiently.

On Thu, Oct 18, 2018 at 9:10 PM Mitar <mm...@gmail.com> wrote:

> Hi!
>
> It seems that if a DataFrame contains both numeric and object columns,
> the whole DataFrame is pickled and not that only object columns are
> pickled? Is this right? Are there any plans to improve this?
>
>
> Mitar
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m
>