You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Jeff Wickstrom <jw...@esri.com> on 2021/05/03 18:24:00 UTC

[C++] Writing a large Arrow table to disk

Hello,

I am getting started and prototyping with the Arrow C++ API. Great stuff! I have some newbie questions.

My first goal is to write a bunch of "records" to an ArrowTable on disk. The resulting file can be bigger than RAM so I want to use a zero copy approach when it is subsequently consumed in a Python script, to_pandas(), etc.

My prototype seems to be working. I build the table following the "Row to columnar conversion" example. Then I open a FileOutputStream, get a RecordBatchWriter from arrow::ipc::NewFileWriter(), and call WriteTable() on it to save the table to disk. I get my desired zero copy usage of it in Python.

My question is how do I build that table on disk in a scalable way? Like when there are billions of records? Maybe the ArrayBuilder() classes themselves are already managing the business of spilling over available RAM? Do I need to point up front to a MemoryMappedFile or use a MemoryManager to make this scale? Maybe I am done already, but it seems like I'll need to break up the content added to the ArrayBuilder() classes while also only adding one chunk to the ChunkedArray(s) for zero copy compatibility.

Please help me understand conceptually the best way to write out a bigger than RAM Arrow table. A modified "Row to columnar conversion" example would be great as well if applicable.

Thank you in advance for your help!

Cheers,
Jeff

Re: [C++] Writing a large Arrow table to disk

Posted by Micah Kornfield <em...@gmail.com>.
>
> Please confirm there is no way around this. If you want a table on disk to
> be zero copy compatible, you must build it from contiguous arrays in
> memory. You can’t build an infinite sized memory mapped, zero copy
> compatible file on disk. Right?
>

For Pandas conversion  I believe this is true, but I'm not an expert.  If
you can process the data incrementally then reading a record batch at a
time and converting to pandas should still be zero copy.

On Wed, May 5, 2021 at 8:46 AM Jeff Wickstrom <jw...@esri.com> wrote:

> Hi Micah, All,
>
>
>
> Thank you for the tip to use WriteRecordBatch. I got that to work. But,
> probably no surprise, it leads to a table on disk that can’t be loaded zero
> copy. “pyarrow.lib.ArrowInvalid: Needed to copy 2 chunks with 0 nulls, but
> zero_copy_only was True”
>
>
>
> Please confirm there is no way around this. If you want a table on disk to
> be zero copy compatible, you must build it from contiguous arrays in
> memory. You can’t build an infinite sized memory mapped, zero copy
> compatible file on disk. Right?
>
>
>
> Does RecordBatchBuilder help with this? I haven’t figured out how to use
> that yet.
>
>
>
> I am using Arrow 1.0.1. Has anything changed the story on this topic in
> more recent releases? Seems we need to upgrade anyway.
>
>
>
> This is how I am using the RecordBatchWriter so far:
>
>
>
>                 // write current batch
>
>
>
>       arrow::ArrayDataVector arrayDataVetor{};
>
>       for (auto& builder: arrayBuilders)
>
>       {
>
>         std::shared_ptr<arrow::ArrayData> arrayData;
>
>         builder->FinishInternal(&arrayData);
>
>         arrayDataVetor.push_back(arrayData);
>
>       }
>
>
>
>       auto recordBatch = arrow::RecordBatch::Make(schema,
> totalRecordsAddedForCurrentBatch, arrayDataVetor);
>
>       auto writeRecordBatchStatus =
> pRecordBatchWriter->WriteRecordBatch(*recordBatch);
>
>       if (!writeRecordBatchStatus.ok())
>
>         return arrow::Status::ExecutionError("Failed to WriteRecordBatch");
>
>
>
> Thank you in advance for your insights. Confirming how to best approach
> this is a big help for our design in progress.
>
>
>
> Jeff
>
>
>
> *From:* Micah Kornfield <em...@gmail.com>
> *Sent:* Monday, May 3, 2021 12:10 PM
> *To:* user@arrow.apache.org
> *Subject:* Re: [C++] Writing a large Arrow table to disk
>
>
>
> Hi Jeff,
>
> Maybe the ArrayBuilder() classes themselves are already managing the
> business of spilling over available RAM?
>
> They do not (unless the underlying allocator already does this).  Either
> way there would be a big memcopy/flush when writing to disk.
>
>
>
>
>
> Do I need to point up front to a MemoryMappedFile or use a MemoryManager
> to make this scale?
>
> Per above this could help but also might not.
>
>
>
> Maybe I am done already, but it seems like I’ll need to break up the
> content added to the ArrayBuilder() classes while also only adding one
> chunk to the ChunkedArray(s) for zero copy compatibility.
>
>
>
> Off the top of my head the best way of doing this if possible it to create
> either Smaller tables or Record batches and write them incrementally (with
> WriteTable or WriteRecordBatch [1])
>
>
>
> [1] https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc17RecordBatchWriter16WriteRecordBatchERK11RecordBatch
> [arrow.apache.org]
> <https://urldefense.com/v3/__https:/arrow.apache.org/docs/cpp/api/ipc.html*_CPPv4N5arrow3ipc17RecordBatchWriter16WriteRecordBatchERK11RecordBatch__;Iw!!CKZwjTOV!gWYjC-rByf1xdIP7MmM3Qt35N7SSVps28PdkA_kH3DfrgHw2l5e76iasnSMPCog$>
>
>
>
> On Mon, May 3, 2021 at 11:24 AM Jeff Wickstrom <jw...@esri.com>
> wrote:
>
> Hello,
>
>
>
> I am getting started and prototyping with the Arrow C++ API. Great stuff!
> I have some newbie questions.
>
>
>
> My first goal is to write a bunch of “records” to an ArrowTable on disk.
> The resulting file can be bigger than RAM so I want to use a zero copy
> approach when it is subsequently consumed in a Python script, to_pandas(),
> etc.
>
>
>
> My prototype seems to be working. I build the table following the “Row to
> columnar conversion” example. Then I open a FileOutputStream, get a
> RecordBatchWriter from arrow::ipc::NewFileWriter(), and call WriteTable()
> on it to save the table to disk. I get my desired zero copy usage of it in
> Python.
>
>
>
> My question is how do I build that table on disk in a scalable way? Like
> when there are billions of records? Maybe the ArrayBuilder() classes
> themselves are already managing the business of spilling over available
> RAM? Do I need to point up front to a MemoryMappedFile or use a
> MemoryManager to make this scale? Maybe I am done already, but it seems
> like I’ll need to break up the content added to the ArrayBuilder() classes
> while also only adding one chunk to the ChunkedArray(s) for zero copy
> compatibility.
>
>
>
> Please help me understand conceptually the best way to write out a bigger
> than RAM Arrow table. A modified “Row to columnar conversion” example would
> be great as well if applicable.
>
>
>
> Thank you in advance for your help!
>
>
>
> Cheers,
>
> Jeff
>
>

RE: [C++] Writing a large Arrow table to disk

Posted by Jeff Wickstrom <jw...@esri.com>.
Hi Micah, All,

Thank you for the tip to use WriteRecordBatch. I got that to work. But, probably no surprise, it leads to a table on disk that can’t be loaded zero copy. “pyarrow.lib.ArrowInvalid: Needed to copy 2 chunks with 0 nulls, but zero_copy_only was True”

Please confirm there is no way around this. If you want a table on disk to be zero copy compatible, you must build it from contiguous arrays in memory. You can’t build an infinite sized memory mapped, zero copy compatible file on disk. Right?

Does RecordBatchBuilder help with this? I haven’t figured out how to use that yet.

I am using Arrow 1.0.1. Has anything changed the story on this topic in more recent releases? Seems we need to upgrade anyway.

This is how I am using the RecordBatchWriter so far:

                // write current batch

      arrow::ArrayDataVector arrayDataVetor{};
      for (auto& builder: arrayBuilders)
      {
        std::shared_ptr<arrow::ArrayData> arrayData;
        builder->FinishInternal(&arrayData);
        arrayDataVetor.push_back(arrayData);
      }

      auto recordBatch = arrow::RecordBatch::Make(schema, totalRecordsAddedForCurrentBatch, arrayDataVetor);
      auto writeRecordBatchStatus = pRecordBatchWriter->WriteRecordBatch(*recordBatch);
      if (!writeRecordBatchStatus.ok())
        return arrow::Status::ExecutionError("Failed to WriteRecordBatch");

Thank you in advance for your insights. Confirming how to best approach this is a big help for our design in progress.

Jeff

From: Micah Kornfield <em...@gmail.com>
Sent: Monday, May 3, 2021 12:10 PM
To: user@arrow.apache.org
Subject: Re: [C++] Writing a large Arrow table to disk

Hi Jeff,
Maybe the ArrayBuilder() classes themselves are already managing the business of spilling over available RAM?
They do not (unless the underlying allocator already does this).  Either way there would be a big memcopy/flush when writing to disk.


Do I need to point up front to a MemoryMappedFile or use a MemoryManager to make this scale?
Per above this could help but also might not.

Maybe I am done already, but it seems like I’ll need to break up the content added to the ArrayBuilder() classes while also only adding one chunk to the ChunkedArray(s) for zero copy compatibility.

Off the top of my head the best way of doing this if possible it to create either Smaller tables or Record batches and write them incrementally (with WriteTable or WriteRecordBatch [1])

[1] https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc17RecordBatchWriter16WriteRecordBatchERK11RecordBatch [arrow.apache.org]<https://urldefense.com/v3/__https:/arrow.apache.org/docs/cpp/api/ipc.html*_CPPv4N5arrow3ipc17RecordBatchWriter16WriteRecordBatchERK11RecordBatch__;Iw!!CKZwjTOV!gWYjC-rByf1xdIP7MmM3Qt35N7SSVps28PdkA_kH3DfrgHw2l5e76iasnSMPCog$>

On Mon, May 3, 2021 at 11:24 AM Jeff Wickstrom <jw...@esri.com>> wrote:
Hello,

I am getting started and prototyping with the Arrow C++ API. Great stuff! I have some newbie questions.

My first goal is to write a bunch of “records” to an ArrowTable on disk. The resulting file can be bigger than RAM so I want to use a zero copy approach when it is subsequently consumed in a Python script, to_pandas(), etc.

My prototype seems to be working. I build the table following the “Row to columnar conversion” example. Then I open a FileOutputStream, get a RecordBatchWriter from arrow::ipc::NewFileWriter(), and call WriteTable() on it to save the table to disk. I get my desired zero copy usage of it in Python.

My question is how do I build that table on disk in a scalable way? Like when there are billions of records? Maybe the ArrayBuilder() classes themselves are already managing the business of spilling over available RAM? Do I need to point up front to a MemoryMappedFile or use a MemoryManager to make this scale? Maybe I am done already, but it seems like I’ll need to break up the content added to the ArrayBuilder() classes while also only adding one chunk to the ChunkedArray(s) for zero copy compatibility.

Please help me understand conceptually the best way to write out a bigger than RAM Arrow table. A modified “Row to columnar conversion” example would be great as well if applicable.

Thank you in advance for your help!

Cheers,
Jeff

Re: [C++] Writing a large Arrow table to disk

Posted by Micah Kornfield <em...@gmail.com>.
Hi Jeff,

> Maybe the ArrayBuilder() classes themselves are already managing the
> business of spilling over available RAM?

They do not (unless the underlying allocator already does this).  Either
way there would be a big memcopy/flush when writing to disk.


Do I need to point up front to a MemoryMappedFile or use a MemoryManager to
> make this scale?

Per above this could help but also might not.

Maybe I am done already, but it seems like I’ll need to break up the
> content added to the ArrayBuilder() classes while also only adding one
> chunk to the ChunkedArray(s) for zero copy compatibility.


Off the top of my head the best way of doing this if possible it to create
either Smaller tables or Record batches and write them incrementally (with
WriteTable or WriteRecordBatch [1])

[1]
https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc17RecordBatchWriter16WriteRecordBatchERK11RecordBatch

On Mon, May 3, 2021 at 11:24 AM Jeff Wickstrom <jw...@esri.com> wrote:

> Hello,
>
>
>
> I am getting started and prototyping with the Arrow C++ API. Great stuff!
> I have some newbie questions.
>
>
>
> My first goal is to write a bunch of “records” to an ArrowTable on disk.
> The resulting file can be bigger than RAM so I want to use a zero copy
> approach when it is subsequently consumed in a Python script, to_pandas(),
> etc.
>
>
>
> My prototype seems to be working. I build the table following the “Row to
> columnar conversion” example. Then I open a FileOutputStream, get a
> RecordBatchWriter from arrow::ipc::NewFileWriter(), and call WriteTable()
> on it to save the table to disk. I get my desired zero copy usage of it in
> Python.
>
>
>
> My question is how do I build that table on disk in a scalable way? Like
> when there are billions of records? Maybe the ArrayBuilder() classes
> themselves are already managing the business of spilling over available
> RAM? Do I need to point up front to a MemoryMappedFile or use a
> MemoryManager to make this scale? Maybe I am done already, but it seems
> like I’ll need to break up the content added to the ArrayBuilder() classes
> while also only adding one chunk to the ChunkedArray(s) for zero copy
> compatibility.
>
>
>
> Please help me understand conceptually the best way to write out a bigger
> than RAM Arrow table. A modified “Row to columnar conversion” example would
> be great as well if applicable.
>
>
>
> Thank you in advance for your help!
>
>
>
> Cheers,
>
> Jeff
>