You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Ishan Anand <an...@outlook.com> on 2020/09/09 10:11:54 UTC

[Python/C-Glib] writing IPC file format column-by-column

Hi

I'm looking at using Arrow primarily on low-resource instances with out of memory datasets. This is the workflow I'm trying to implement.


  *   Write record batches in IPC streaming format to a file from a C runtime.
  *   Consume it one row at a time from python/C by loading the file in chunks.
  *   If the schema is simple enough to support zero copy operations, make the table readable from pandas. This needs me to,
     *   convert it into a Table with a single chunk per column (since pandas can't use mmap with chunked arrays).
     *   write the table in IPC random access format to a file.

PyArrow provides a method `combine_chunks` to combine chunks into a single chunk. However, it needs to create the entire table in memory (I suspect it is 2x, since it loads both versions of the table in memory but that can be avoided).

Since the Arrow layout is columnar, I'm curious if it is possible to write the table one column at a time. And if the existing glib/python APIs support it? The C++ file writer objects seem to go down to serializing a single record batch at a time and not per column.


Thank you,
Ishan

Re: [Python/C-Glib] writing IPC file format column-by-column

Posted by Ishan Anand <an...@outlook.com>.

Hi

Updating the thread for people with a similar use case. A new project called [duckdb](https://github.com/cwida/duckdb) allows usage of Arrow memory mapped files as virtual tables, so a lot of pandas functionality can be covered using their sql equivalents. Duckdb works equally well with chunked tables, so that alleviates the need for contiguous columns in the Arrow file.

Thank you,
Ishan
________________________________
From: Sutou Kouhei <ko...@clear-code.com>
Sent: Friday, September 11, 2020 3:23 AM
To: user@arrow.apache.org <us...@arrow.apache.org>; dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: [Python/C-Glib] writing IPC file format column-by-column

Hi,

I add dev@ because this may need to improve Apache Arrow C++.

It seems that we need the following new feature for this use
case (combining chunks with small memory to process large
data with pandas, mmap and small memory):

  * Writing chunks in arrow::Table as one large
    arrow::RecordTable without creating intermediate
    combined chunks

The current arrow::ipc::RecordBatchWriter::WriteTable()
always splits the given arrow::Table to one or more
arrow::RecordBatch. We may be able to add the feature that
writes the given arrow::Table as one combined
arrow::RecordBatch without creating intermediate combined
chunks.


Do C++ developers have any opinion on this?


Thanks,
--
kou

In
 <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
  "[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep 2020 10:11:54 +0000,
  Ishan Anand <an...@outlook.com> wrote:

> Hi
>
> I'm looking at using Arrow primarily on low-resource instances with out of memory datasets. This is the workflow I'm trying to implement.
>
>
>   *   Write record batches in IPC streaming format to a file from a C runtime.
>   *   Consume it one row at a time from python/C by loading the file in chunks.
>   *   If the schema is simple enough to support zero copy operations, make the table readable from pandas. This needs me to,
>      *   convert it into a Table with a single chunk per column (since pandas can't use mmap with chunked arrays).
>      *   write the table in IPC random access format to a file.
>
> PyArrow provides a method `combine_chunks` to combine chunks into a single chunk. However, it needs to create the entire table in memory (I suspect it is 2x, since it loads both versions of the table in memory but that can be avoided).
>
> Since the Arrow layout is columnar, I'm curious if it is possible to write the table one column at a time. And if the existing glib/python APIs support it? The C++ file writer objects seem to go down to serializing a single record batch at a time and not per column.
>
>
> Thank you,
> Ishan

Re: [Python/C-Glib] writing IPC file format column-by-column

Posted by Ishan Anand <an...@outlook.com>.

Hi

Updating the thread for people with a similar use case. A new project called [duckdb](https://github.com/cwida/duckdb) allows usage of Arrow memory mapped files as virtual tables, so a lot of pandas functionality can be covered using their sql equivalents. Duckdb works equally well with chunked tables, so that alleviates the need for contiguous columns in the Arrow file.

Thank you,
Ishan
________________________________
From: Sutou Kouhei <ko...@clear-code.com>
Sent: Friday, September 11, 2020 3:23 AM
To: user@arrow.apache.org <us...@arrow.apache.org>; dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: [Python/C-Glib] writing IPC file format column-by-column

Hi,

I add dev@ because this may need to improve Apache Arrow C++.

It seems that we need the following new feature for this use
case (combining chunks with small memory to process large
data with pandas, mmap and small memory):

  * Writing chunks in arrow::Table as one large
    arrow::RecordTable without creating intermediate
    combined chunks

The current arrow::ipc::RecordBatchWriter::WriteTable()
always splits the given arrow::Table to one or more
arrow::RecordBatch. We may be able to add the feature that
writes the given arrow::Table as one combined
arrow::RecordBatch without creating intermediate combined
chunks.


Do C++ developers have any opinion on this?


Thanks,
--
kou

In
 <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
  "[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep 2020 10:11:54 +0000,
  Ishan Anand <an...@outlook.com> wrote:

> Hi
>
> I'm looking at using Arrow primarily on low-resource instances with out of memory datasets. This is the workflow I'm trying to implement.
>
>
>   *   Write record batches in IPC streaming format to a file from a C runtime.
>   *   Consume it one row at a time from python/C by loading the file in chunks.
>   *   If the schema is simple enough to support zero copy operations, make the table readable from pandas. This needs me to,
>      *   convert it into a Table with a single chunk per column (since pandas can't use mmap with chunked arrays).
>      *   write the table in IPC random access format to a file.
>
> PyArrow provides a method `combine_chunks` to combine chunks into a single chunk. However, it needs to create the entire table in memory (I suspect it is 2x, since it loads both versions of the table in memory but that can be avoided).
>
> Since the Arrow layout is columnar, I'm curious if it is possible to write the table one column at a time. And if the existing glib/python APIs support it? The C++ file writer objects seem to go down to serializing a single record batch at a time and not per column.
>
>
> Thank you,
> Ishan

Re: [Python/C-Glib] writing IPC file format column-by-column

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

I add dev@ because this may need to improve Apache Arrow C++.

It seems that we need the following new feature for this use
case (combining chunks with small memory to process large
data with pandas, mmap and small memory):

  * Writing chunks in arrow::Table as one large
    arrow::RecordTable without creating intermediate
    combined chunks

The current arrow::ipc::RecordBatchWriter::WriteTable()
always splits the given arrow::Table to one or more
arrow::RecordBatch. We may be able to add the feature that
writes the given arrow::Table as one combined
arrow::RecordBatch without creating intermediate combined
chunks.


Do C++ developers have any opinion on this?


Thanks,
--
kou

In 
 <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
  "[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep 2020 10:11:54 +0000,
  Ishan Anand <an...@outlook.com> wrote:

> Hi
> 
> I'm looking at using Arrow primarily on low-resource instances with out of memory datasets. This is the workflow I'm trying to implement.
> 
> 
>   *   Write record batches in IPC streaming format to a file from a C runtime.
>   *   Consume it one row at a time from python/C by loading the file in chunks.
>   *   If the schema is simple enough to support zero copy operations, make the table readable from pandas. This needs me to,
>      *   convert it into a Table with a single chunk per column (since pandas can't use mmap with chunked arrays).
>      *   write the table in IPC random access format to a file.
> 
> PyArrow provides a method `combine_chunks` to combine chunks into a single chunk. However, it needs to create the entire table in memory (I suspect it is 2x, since it loads both versions of the table in memory but that can be avoided).
> 
> Since the Arrow layout is columnar, I'm curious if it is possible to write the table one column at a time. And if the existing glib/python APIs support it? The C++ file writer objects seem to go down to serializing a single record batch at a time and not per column.
> 
> 
> Thank you,
> Ishan

Re: [Python/C-Glib] writing IPC file format column-by-column

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

I add dev@ because this may need to improve Apache Arrow C++.

It seems that we need the following new feature for this use
case (combining chunks with small memory to process large
data with pandas, mmap and small memory):

  * Writing chunks in arrow::Table as one large
    arrow::RecordTable without creating intermediate
    combined chunks

The current arrow::ipc::RecordBatchWriter::WriteTable()
always splits the given arrow::Table to one or more
arrow::RecordBatch. We may be able to add the feature that
writes the given arrow::Table as one combined
arrow::RecordBatch without creating intermediate combined
chunks.


Do C++ developers have any opinion on this?


Thanks,
--
kou

In 
 <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
  "[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep 2020 10:11:54 +0000,
  Ishan Anand <an...@outlook.com> wrote:

> Hi
> 
> I'm looking at using Arrow primarily on low-resource instances with out of memory datasets. This is the workflow I'm trying to implement.
> 
> 
>   *   Write record batches in IPC streaming format to a file from a C runtime.
>   *   Consume it one row at a time from python/C by loading the file in chunks.
>   *   If the schema is simple enough to support zero copy operations, make the table readable from pandas. This needs me to,
>      *   convert it into a Table with a single chunk per column (since pandas can't use mmap with chunked arrays).
>      *   write the table in IPC random access format to a file.
> 
> PyArrow provides a method `combine_chunks` to combine chunks into a single chunk. However, it needs to create the entire table in memory (I suspect it is 2x, since it loads both versions of the table in memory but that can be avoided).
> 
> Since the Arrow layout is columnar, I'm curious if it is possible to write the table one column at a time. And if the existing glib/python APIs support it? The C++ file writer objects seem to go down to serializing a single record batch at a time and not per column.
> 
> 
> Thank you,
> Ishan