You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by jonathan mercier <jo...@cnrgh.fr> on 2021/02/12 14:21:44 UTC
Can I load from a parquet file only few columns ?
Dear,
I have a parquet files with 300 000 columns and 30 000 rows.
If I load a such file to pandas dataframe (with pyarrow) that take
around 100 GO of ram.
As I perform a pairwise comparison between column I could load those
data by N columns by N columns.
So is it possible to load from a parquet file only few columns by their
names ? Which will save some memory.
Thanks
--
Researcher computational biology
PhD, Jonathan MERCIER
Bioinformatics (LBI)
2, rue Gaston
Crémieux
91057 Evry Cedex
Tel :(+33)1 60 87 83 44
Email :jonathan.mercier@cnrgh.fr
Re: Can I load from a parquet file only few columns ?
Posted by Jacek Pliszka <ja...@gmail.com>.
1. conversion - when you work with data coming from SQL often it is
Decimal - pandas handling of decimals is very inefficient - you can
convert them to int/float in arrow and pass to pandas - uses much less
memory
2. filters - check filters options of ParquetDataset or read_parquet -
you can filter only the rows you need
BR,
Jacek
pt., 12 lut 2021 o 15:45 jonathan mercier <jo...@cnrgh.fr>
napisał(a):
>
> Oh yes, I can do this too.
> Thanks
> Now when I see parquet I think pyarrow :-)
>
> what did you mean by conversion of filtering ?
> Could you provides a little example please
>
> Anyway
>
> Have a goo day
>
> Le vendredi 12 février 2021 à 15:26 +0100, Jacek Pliszka a écrit :
> > Sure - I believe you can do it even in pandas - you have columns
> > parameter: pd.read_parquet('f.pq', columns=['A', 'B'])
> >
> > arrow is more useful if you need to do some conversion of filtering.
> >
> > BR,
> >
> > Jacek
> >
> > pt., 12 lut 2021 o 15:21 jonathan mercier <jo...@cnrgh.fr>
> > napisał(a):
> > >
> > > Dear,
> > > I have a parquet files with 300 000 columns and 30 000 rows.
> > > If I load a such file to pandas dataframe (with pyarrow) that take
> > > around 100 GO of ram.
> > >
> > > As I perform a pairwise comparison between column I could load
> > > those
> > > data by N columns by N columns.
> > >
> > > So is it possible to load from a parquet file only few columns by
> > > their
> > > names ? Which will save some memory.
> > >
> > > Thanks
> > >
> > >
> > > --
> > > Researcher computational biology
> > > PhD, Jonathan MERCIER
> > >
> > > Bioinformatics (LBI)
> > > 2, rue Gaston
> > > Crémieux
> > > 91057 Evry Cedex
> > >
> > >
> > > Tel :(+33)1 60 87 83 44
> > > Email :jonathan.mercier@cnrgh.fr
> > >
> > >
> > >
>
> --
> Researcher computational biology
> PhD, Jonathan MERCIER
>
> Bioinformatics (LBI)
> 2, rue Gaston
> Crémieux
> 91057 Evry Cedex
>
>
> Tel :(+33)1 60 87 83 44
> Email :jonathan.mercier@cnrgh.fr
>
>
>
Re: Can I load from a parquet file only few columns ?
Posted by jonathan mercier <jo...@cnrgh.fr>.
Oh yes, I can do this too.
Thanks
Now when I see parquet I think pyarrow :-)
what did you mean by conversion of filtering ?
Could you provides a little example please
Anyway
Have a goo day
Le vendredi 12 février 2021 à 15:26 +0100, Jacek Pliszka a écrit :
> Sure - I believe you can do it even in pandas - you have columns
> parameter: pd.read_parquet('f.pq', columns=['A', 'B'])
>
> arrow is more useful if you need to do some conversion of filtering.
>
> BR,
>
> Jacek
>
> pt., 12 lut 2021 o 15:21 jonathan mercier <jo...@cnrgh.fr>
> napisał(a):
> >
> > Dear,
> > I have a parquet files with 300 000 columns and 30 000 rows.
> > If I load a such file to pandas dataframe (with pyarrow) that take
> > around 100 GO of ram.
> >
> > As I perform a pairwise comparison between column I could load
> > those
> > data by N columns by N columns.
> >
> > So is it possible to load from a parquet file only few columns by
> > their
> > names ? Which will save some memory.
> >
> > Thanks
> >
> >
> > --
> > Researcher computational biology
> > PhD, Jonathan MERCIER
> >
> > Bioinformatics (LBI)
> > 2, rue Gaston
> > Crémieux
> > 91057 Evry Cedex
> >
> >
> > Tel :(+33)1 60 87 83 44
> > Email :jonathan.mercier@cnrgh.fr
> >
> >
> >
--
Researcher computational biology
PhD, Jonathan MERCIER
Bioinformatics (LBI)
2, rue Gaston
Crémieux
91057 Evry Cedex
Tel :(+33)1 60 87 83 44
Email :jonathan.mercier@cnrgh.fr
Re: Can I load from a parquet file only few columns ?
Posted by Jacek Pliszka <ja...@gmail.com>.
Sure - I believe you can do it even in pandas - you have columns
parameter: pd.read_parquet('f.pq', columns=['A', 'B'])
arrow is more useful if you need to do some conversion of filtering.
BR,
Jacek
pt., 12 lut 2021 o 15:21 jonathan mercier <jo...@cnrgh.fr>
napisał(a):
>
> Dear,
> I have a parquet files with 300 000 columns and 30 000 rows.
> If I load a such file to pandas dataframe (with pyarrow) that take
> around 100 GO of ram.
>
> As I perform a pairwise comparison between column I could load those
> data by N columns by N columns.
>
> So is it possible to load from a parquet file only few columns by their
> names ? Which will save some memory.
>
> Thanks
>
>
> --
> Researcher computational biology
> PhD, Jonathan MERCIER
>
> Bioinformatics (LBI)
> 2, rue Gaston
> Crémieux
> 91057 Evry Cedex
>
>
> Tel :(+33)1 60 87 83 44
> Email :jonathan.mercier@cnrgh.fr
>
>
>