You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Hei Chan <st...@yahoo.com.INVALID> on 2020/04/24 14:31:10 UTC

Strategy for Writing a Large Table?

Hi,
I am new to Arrow and Parquet.
My goal is to decode a 4GB binary file (packed c struct) and write all records to a file that can be used by R dataframe and Pandas dataframe and so others can do some heavy analysis on the big dataset efficiently (in terms of loading time and running statistical analysis).
I first tried to do something like this in Python:
# for each record after I decodeupdates.append(result) # updates = deque()
# then after reading in all recordspd_updates = pd.DataFrame(updates) # I think I got out of memory here that OOM handler kicked in and killed my process

pd_book_updates['my_cat_col'].astype('category', copy=False)
table = pa.Table.from_pandas(pd_updates, preserve_index=False)
pq.write_table(table, 'my.parquet', compression='brotli')

What's the recommended way to deal with big dataset conversion? And later loading from R and Python (pandas)?
Thanks in advance!

Re: Strategy for Writing a Large Table?

Posted by Wes McKinney <we...@gmail.com>.
RecordBatchFileWriter and ParquetWriter create different kinds of
files (Arrow IPC vs Parquet, respectively). Otherwise there is
generally not much difference performance or memory use wise to
writing RecordBatch vs Table

On Sat, Apr 25, 2020 at 5:33 AM Hei Chan
<st...@yahoo.com.invalid> wrote:
>
>  I managed to iterate through small chucks of data, and for each chuck, I convert it into pandas.DataFrame, convert into Table, and then write to a parquet file.
> Is there any advantage to write RecordBatch (through pyarrow.RecordBatchFileWriter.write_batch()?) instead of ParquetWriter.write_table()?
> Thanks.    On Saturday, April 25, 2020, 10:15:34 AM GMT+8, Hei Chan <st...@yahoo.com.invalid> wrote:
>
>    Hi Wes,
> Thanks for your pointers.
> It seems like to skip pandas as intermediary, I can only construct pyarrow.RecordBatch from pyarrow.Array or pyarrow.StructArray:https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html
> And StructArray.from_pandas()'s description states, "Convert pandas.Series to an Arrow Array".
> So you are suggesting to build a python list of StructArray directly in batch, and then call pyarrow.RecordBatch.from_arrays() and then call pyarrow.Table.from_batches() after I convert all record from my binary file into RecordBatches, and then pyarrow.parquet.write_table()?  It seems holding all RecordBatches will not fitted into my memory.  But pyarrow.parquet doesn't seem to allow to "append" Tables.
> Is there an easy way to construct StructArray without constructing pandas.Series()?
>
>     On Friday, April 24, 2020, 10:41:49 PM GMT+8, Wes McKinney <we...@gmail.com> wrote:
>
>  I recommend going directly via Arrow instead of routing through pandas (or
> at least only using pandas as an intermediary to convert smaller chunks to
> Arrow). Tables can be composed from smaller RecordBatch objects (see
> Table.from_batches) so you don't need to accumulate much non-Arrow data in
> memory. You can also zero-copy concat tables with concat_tables
>
> On Fri, Apr 24, 2020, 9:31 AM Hei Chan <st...@yahoo.com.invalid>
> wrote:
>
> > Hi,
> > I am new to Arrow and Parquet.
> > My goal is to decode a 4GB binary file (packed c struct) and write all
> > records to a file that can be used by R dataframe and Pandas dataframe and
> > so others can do some heavy analysis on the big dataset efficiently (in
> > terms of loading time and running statistical analysis).
> > I first tried to do something like this in Python:
> > # for each record after I decodeupdates.append(result) # updates = deque()
> > # then after reading in all recordspd_updates = pd.DataFrame(updates) # I
> > think I got out of memory here that OOM handler kicked in and killed my
> > process
> >
> > pd_book_updates['my_cat_col'].astype('category', copy=False)
> > table = pa.Table.from_pandas(pd_updates, preserve_index=False)
> > pq.write_table(table, 'my.parquet', compression='brotli')
> >
> > What's the recommended way to deal with big dataset conversion? And later
> > loading from R and Python (pandas)?
> > Thanks in advance!
>

Re: Strategy for Writing a Large Table?

Posted by Hei Chan <st...@yahoo.com.INVALID>.
 I managed to iterate through small chucks of data, and for each chuck, I convert it into pandas.DataFrame, convert into Table, and then write to a parquet file.
Is there any advantage to write RecordBatch (through pyarrow.RecordBatchFileWriter.write_batch()?) instead of ParquetWriter.write_table()?
Thanks.    On Saturday, April 25, 2020, 10:15:34 AM GMT+8, Hei Chan <st...@yahoo.com.invalid> wrote:  
 
   Hi Wes,
Thanks for your pointers.
It seems like to skip pandas as intermediary, I can only construct pyarrow.RecordBatch from pyarrow.Array or pyarrow.StructArray:https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html
And StructArray.from_pandas()'s description states, "Convert pandas.Series to an Arrow Array".
So you are suggesting to build a python list of StructArray directly in batch, and then call pyarrow.RecordBatch.from_arrays() and then call pyarrow.Table.from_batches() after I convert all record from my binary file into RecordBatches, and then pyarrow.parquet.write_table()?  It seems holding all RecordBatches will not fitted into my memory.  But pyarrow.parquet doesn't seem to allow to "append" Tables.
Is there an easy way to construct StructArray without constructing pandas.Series()?

    On Friday, April 24, 2020, 10:41:49 PM GMT+8, Wes McKinney <we...@gmail.com> wrote:  
 
 I recommend going directly via Arrow instead of routing through pandas (or
at least only using pandas as an intermediary to convert smaller chunks to
Arrow). Tables can be composed from smaller RecordBatch objects (see
Table.from_batches) so you don't need to accumulate much non-Arrow data in
memory. You can also zero-copy concat tables with concat_tables

On Fri, Apr 24, 2020, 9:31 AM Hei Chan <st...@yahoo.com.invalid>
wrote:

> Hi,
> I am new to Arrow and Parquet.
> My goal is to decode a 4GB binary file (packed c struct) and write all
> records to a file that can be used by R dataframe and Pandas dataframe and
> so others can do some heavy analysis on the big dataset efficiently (in
> terms of loading time and running statistical analysis).
> I first tried to do something like this in Python:
> # for each record after I decodeupdates.append(result) # updates = deque()
> # then after reading in all recordspd_updates = pd.DataFrame(updates) # I
> think I got out of memory here that OOM handler kicked in and killed my
> process
>
> pd_book_updates['my_cat_col'].astype('category', copy=False)
> table = pa.Table.from_pandas(pd_updates, preserve_index=False)
> pq.write_table(table, 'my.parquet', compression='brotli')
>
> What's the recommended way to deal with big dataset conversion? And later
> loading from R and Python (pandas)?
> Thanks in advance!
    

Re: Strategy for Writing a Large Table?

Posted by Hei Chan <st...@yahoo.com.INVALID>.
  Hi Wes,
Thanks for your pointers.
It seems like to skip pandas as intermediary, I can only construct pyarrow.RecordBatch from pyarrow.Array or pyarrow.StructArray:https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html
And StructArray.from_pandas()'s description states, "Convert pandas.Series to an Arrow Array".
So you are suggesting to build a python list of StructArray directly in batch, and then call pyarrow.RecordBatch.from_arrays() and then call pyarrow.Table.from_batches() after I convert all record from my binary file into RecordBatches, and then pyarrow.parquet.write_table()?  It seems holding all RecordBatches will not fitted into my memory.  But pyarrow.parquet doesn't seem to allow to "append" Tables.
Is there an easy way to construct StructArray without constructing pandas.Series()?

    On Friday, April 24, 2020, 10:41:49 PM GMT+8, Wes McKinney <we...@gmail.com> wrote:  
 
 I recommend going directly via Arrow instead of routing through pandas (or
at least only using pandas as an intermediary to convert smaller chunks to
Arrow). Tables can be composed from smaller RecordBatch objects (see
Table.from_batches) so you don't need to accumulate much non-Arrow data in
memory. You can also zero-copy concat tables with concat_tables

On Fri, Apr 24, 2020, 9:31 AM Hei Chan <st...@yahoo.com.invalid>
wrote:

> Hi,
> I am new to Arrow and Parquet.
> My goal is to decode a 4GB binary file (packed c struct) and write all
> records to a file that can be used by R dataframe and Pandas dataframe and
> so others can do some heavy analysis on the big dataset efficiently (in
> terms of loading time and running statistical analysis).
> I first tried to do something like this in Python:
> # for each record after I decodeupdates.append(result) # updates = deque()
> # then after reading in all recordspd_updates = pd.DataFrame(updates) # I
> think I got out of memory here that OOM handler kicked in and killed my
> process
>
> pd_book_updates['my_cat_col'].astype('category', copy=False)
> table = pa.Table.from_pandas(pd_updates, preserve_index=False)
> pq.write_table(table, 'my.parquet', compression='brotli')
>
> What's the recommended way to deal with big dataset conversion? And later
> loading from R and Python (pandas)?
> Thanks in advance!
  

Re: Strategy for Writing a Large Table?

Posted by Wes McKinney <we...@gmail.com>.
I recommend going directly via Arrow instead of routing through pandas (or
at least only using pandas as an intermediary to convert smaller chunks to
Arrow). Tables can be composed from smaller RecordBatch objects (see
Table.from_batches) so you don't need to accumulate much non-Arrow data in
memory. You can also zero-copy concat tables with concat_tables

On Fri, Apr 24, 2020, 9:31 AM Hei Chan <st...@yahoo.com.invalid>
wrote:

> Hi,
> I am new to Arrow and Parquet.
> My goal is to decode a 4GB binary file (packed c struct) and write all
> records to a file that can be used by R dataframe and Pandas dataframe and
> so others can do some heavy analysis on the big dataset efficiently (in
> terms of loading time and running statistical analysis).
> I first tried to do something like this in Python:
> # for each record after I decodeupdates.append(result) # updates = deque()
> # then after reading in all recordspd_updates = pd.DataFrame(updates) # I
> think I got out of memory here that OOM handler kicked in and killed my
> process
>
> pd_book_updates['my_cat_col'].astype('category', copy=False)
> table = pa.Table.from_pandas(pd_updates, preserve_index=False)
> pq.write_table(table, 'my.parquet', compression='brotli')
>
> What's the recommended way to deal with big dataset conversion? And later
> loading from R and Python (pandas)?
> Thanks in advance!