You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Ishan Anand <an...@outlook.com> on 2020/09/06 07:40:06 UTC

[C-GLib] reading values quickly from a list array

Hi

I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.

 To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
```py
import pyarrow as pa

foo = {
    "colA": np.arange(0, 1000_000),
    "colB": [np.arange(1, 5)] * 1000_000
}

table = pa.table(foo)
with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
    for _ in range(100):
        writer.write_table(table)
```

However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.

The relevant snippet is this:
```C
    guint num_batches = 100;
    for (i = 0; i < num_batches; i++) {
        GArrowRecordBatch *record_batch;
        record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);

        GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
        guint length_list = garrow_array_get_length(column);
        GArrowListArray* list_arr = (GArrowListArray*)column;

        guint j;
        GArrowArray* list_elem;
        for (j = 0; j < length_list; j++) {
            list_elem = garrow_list_array_get_value(list_arr, j);
        }
    }
```

I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?


Thank you,
Ishan




Re: [C-GLib] reading values quickly from a list array

Posted by Ishan Anand <an...@outlook.com>.
I'll make sure to do that. Thank you again.

Best,
Ishan
________________________________
From: Sutou Kouhei <ko...@clear-code.com>
Sent: Tuesday, September 8, 2020 2:56 AM
To: user@arrow.apache.org <us...@arrow.apache.org>
Subject: Re: [C-GLib] reading values quickly from a list array

Hi,

I've merged it.

Note that you need to install Apache Arrow C++ (master) before you
install Apache Arrow GLib (master). Apache Arrow GLib
depends on Apache Arrow C++.

Thanks,
--
kou


In
 <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
  "Re: [C-GLib] reading values quickly from a list array " on Mon, 7 Sep 2020 04:54:24 +0000,
  Ishan Anand <an...@outlook.com> wrote:

> Thank you very much for the commit Kouhei-san. I'd love to use it sooner so I'll use the source code directly to build Arrow-glib once this PR is in.
>
>
> Thank you,
> Ishan
> ________________________________
> From: Sutou Kouhei <ko...@clear-code.com>
> Sent: Monday, September 7, 2020 6:44 AM
> To: user@arrow.apache.org <us...@arrow.apache.org>
> Subject: Re: [C-GLib] reading values quickly from a list array
>
> Hi,
>
> garrow_list_array_get_value() is a bit high cost function
> because it creates a sub list array. It doesn't copy array
> data (it shares array data) but it creates a new sub array
> (container for data) in C++ level and C level.
>
> Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
> list array values. Sorry. I've implemented them:
> https://github.com/apache/arrow/pull/8119
>
> It'll be included in Apache Arrow GLib 2.0.0 that will be
> released in a few months.
>
> (Can you wait 2.0.0?)
>
> With these APIs, you can write like the following:
>
> ----
> #include <stdlib.h>
> #include <arrow-glib/arrow-glib.h>
>
> int
> main(void)
> {
>   GError *error = NULL;
>
>   GArrowMemoryMappedInputStream *input;
>   input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
>   if (!input) {
>     g_print("failed to open file: %s\n", error->message);
>     g_error_free(error);
>     return EXIT_FAILURE;
>   }
>
>   {
>     GArrowRecordBatchFileReader *reader;
>     reader =
>       garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
>                                           &error);
>
>     if (!reader) {
>       g_print("failed to open file reader: %s\n", error->message);
>       g_error_free(error);
>       g_object_unref(input);
>       return EXIT_FAILURE;
>     }
>
>     {
>       guint i;
>       guint num_batches = 100;
>       for (i = 0; i < num_batches; i++) {
>         GArrowRecordBatch *record_batch;
>         record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>
>         GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
>         guint length_list = garrow_array_get_length(column);
>
>         GArrowListArray* list_arr = (GArrowListArray*)column;
>
>         GArrowInt64Array *list_values =
>           GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
>         gint64 n_list_values;
>         const gint64 *raw_list_values =
>           garrow_int64_array_get_values(list_values, &n_list_values);
>         gint64 n_value_offsets;
>         const gint32 *value_offsets =
>           garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
>         guint j;
>         for (j = 0; j < n_value_offsets; ++j) {
>           gint32 value_offset = value_offsets[j];
>           gint32 value_length = value_offsets[j + 1] - value_offset;
>           gint32 k;
>           for (k = 0; k < value_length; ++k) {
>             raw_list_values[value_offset + k];
>           }
>         }
>         g_object_unref(list_values);
>
>         g_object_unref(column);
>
>         g_object_unref(record_batch);
>       }
>     }
>     g_object_unref(reader);
>   }
>
>   g_object_unref(input);
>
>   return EXIT_SUCCESS;
> }
> ----
>
> It takes 0.5sec on my machine.
>
>
> Thanks,
> --
> kou
>
> In
>  <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
>   "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 07:40:06 +0000,
>   Ishan Anand <an...@outlook.com> wrote:
>
>> Hi
>>
>> I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.
>>
>>  To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
>> ```py
>> import pyarrow as pa
>>
>> foo = {
>>     "colA": np.arange(0, 1000_000),
>>     "colB": [np.arange(1, 5)] * 1000_000
>> }
>>
>> table = pa.table(foo)
>> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
>>     for _ in range(100):
>>         writer.write_table(table)
>> ```
>>
>> However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.
>>
>> The relevant snippet is this:
>> ```C
>>     guint num_batches = 100;
>>     for (i = 0; i < num_batches; i++) {
>>         GArrowRecordBatch *record_batch;
>>         record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>>
>>         GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
>>         guint length_list = garrow_array_get_length(column);
>>         GArrowListArray* list_arr = (GArrowListArray*)column;
>>
>>         guint j;
>>         GArrowArray* list_elem;
>>         for (j = 0; j < length_list; j++) {
>>             list_elem = garrow_list_array_get_value(list_arr, j);
>>         }
>>     }
>> ```
>>
>> I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?
>>
>>
>> Thank you,
>> Ishan
>>
>>
>>

Re: [C-GLib] reading values quickly from a list array

Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,

I've merged it.

Note that you need to install Apache Arrow C++ (master) before you
install Apache Arrow GLib (master). Apache Arrow GLib
depends on Apache Arrow C++.

Thanks,
--
kou


In 
 <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
  "Re: [C-GLib] reading values quickly from a list array " on Mon, 7 Sep 2020 04:54:24 +0000,
  Ishan Anand <an...@outlook.com> wrote:

> Thank you very much for the commit Kouhei-san. I'd love to use it sooner so I'll use the source code directly to build Arrow-glib once this PR is in.
> 
> 
> Thank you,
> Ishan
> ________________________________
> From: Sutou Kouhei <ko...@clear-code.com>
> Sent: Monday, September 7, 2020 6:44 AM
> To: user@arrow.apache.org <us...@arrow.apache.org>
> Subject: Re: [C-GLib] reading values quickly from a list array
> 
> Hi,
> 
> garrow_list_array_get_value() is a bit high cost function
> because it creates a sub list array. It doesn't copy array
> data (it shares array data) but it creates a new sub array
> (container for data) in C++ level and C level.
> 
> Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
> list array values. Sorry. I've implemented them:
> https://github.com/apache/arrow/pull/8119
> 
> It'll be included in Apache Arrow GLib 2.0.0 that will be
> released in a few months.
> 
> (Can you wait 2.0.0?)
> 
> With these APIs, you can write like the following:
> 
> ----
> #include <stdlib.h>
> #include <arrow-glib/arrow-glib.h>
> 
> int
> main(void)
> {
>   GError *error = NULL;
> 
>   GArrowMemoryMappedInputStream *input;
>   input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
>   if (!input) {
>     g_print("failed to open file: %s\n", error->message);
>     g_error_free(error);
>     return EXIT_FAILURE;
>   }
> 
>   {
>     GArrowRecordBatchFileReader *reader;
>     reader =
>       garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
>                                           &error);
> 
>     if (!reader) {
>       g_print("failed to open file reader: %s\n", error->message);
>       g_error_free(error);
>       g_object_unref(input);
>       return EXIT_FAILURE;
>     }
> 
>     {
>       guint i;
>       guint num_batches = 100;
>       for (i = 0; i < num_batches; i++) {
>         GArrowRecordBatch *record_batch;
>         record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
> 
>         GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
>         guint length_list = garrow_array_get_length(column);
> 
>         GArrowListArray* list_arr = (GArrowListArray*)column;
> 
>         GArrowInt64Array *list_values =
>           GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
>         gint64 n_list_values;
>         const gint64 *raw_list_values =
>           garrow_int64_array_get_values(list_values, &n_list_values);
>         gint64 n_value_offsets;
>         const gint32 *value_offsets =
>           garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
>         guint j;
>         for (j = 0; j < n_value_offsets; ++j) {
>           gint32 value_offset = value_offsets[j];
>           gint32 value_length = value_offsets[j + 1] - value_offset;
>           gint32 k;
>           for (k = 0; k < value_length; ++k) {
>             raw_list_values[value_offset + k];
>           }
>         }
>         g_object_unref(list_values);
> 
>         g_object_unref(column);
> 
>         g_object_unref(record_batch);
>       }
>     }
>     g_object_unref(reader);
>   }
> 
>   g_object_unref(input);
> 
>   return EXIT_SUCCESS;
> }
> ----
> 
> It takes 0.5sec on my machine.
> 
> 
> Thanks,
> --
> kou
> 
> In
>  <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
>   "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 07:40:06 +0000,
>   Ishan Anand <an...@outlook.com> wrote:
> 
>> Hi
>>
>> I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.
>>
>>  To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
>> ```py
>> import pyarrow as pa
>>
>> foo = {
>>     "colA": np.arange(0, 1000_000),
>>     "colB": [np.arange(1, 5)] * 1000_000
>> }
>>
>> table = pa.table(foo)
>> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
>>     for _ in range(100):
>>         writer.write_table(table)
>> ```
>>
>> However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.
>>
>> The relevant snippet is this:
>> ```C
>>     guint num_batches = 100;
>>     for (i = 0; i < num_batches; i++) {
>>         GArrowRecordBatch *record_batch;
>>         record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>>
>>         GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
>>         guint length_list = garrow_array_get_length(column);
>>         GArrowListArray* list_arr = (GArrowListArray*)column;
>>
>>         guint j;
>>         GArrowArray* list_elem;
>>         for (j = 0; j < length_list; j++) {
>>             list_elem = garrow_list_array_get_value(list_arr, j);
>>         }
>>     }
>> ```
>>
>> I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?
>>
>>
>> Thank you,
>> Ishan
>>
>>
>>

Re: [C-GLib] reading values quickly from a list array

Posted by Ishan Anand <an...@outlook.com>.
Thank you very much for the commit Kouhei-san. I'd love to use it sooner so I'll use the source code directly to build Arrow-glib once this PR is in.


Thank you,
Ishan
________________________________
From: Sutou Kouhei <ko...@clear-code.com>
Sent: Monday, September 7, 2020 6:44 AM
To: user@arrow.apache.org <us...@arrow.apache.org>
Subject: Re: [C-GLib] reading values quickly from a list array

Hi,

garrow_list_array_get_value() is a bit high cost function
because it creates a sub list array. It doesn't copy array
data (it shares array data) but it creates a new sub array
(container for data) in C++ level and C level.

Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
list array values. Sorry. I've implemented them:
https://github.com/apache/arrow/pull/8119

It'll be included in Apache Arrow GLib 2.0.0 that will be
released in a few months.

(Can you wait 2.0.0?)

With these APIs, you can write like the following:

----
#include <stdlib.h>
#include <arrow-glib/arrow-glib.h>

int
main(void)
{
  GError *error = NULL;

  GArrowMemoryMappedInputStream *input;
  input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
  if (!input) {
    g_print("failed to open file: %s\n", error->message);
    g_error_free(error);
    return EXIT_FAILURE;
  }

  {
    GArrowRecordBatchFileReader *reader;
    reader =
      garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
                                          &error);

    if (!reader) {
      g_print("failed to open file reader: %s\n", error->message);
      g_error_free(error);
      g_object_unref(input);
      return EXIT_FAILURE;
    }

    {
      guint i;
      guint num_batches = 100;
      for (i = 0; i < num_batches; i++) {
        GArrowRecordBatch *record_batch;
        record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);

        GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
        guint length_list = garrow_array_get_length(column);

        GArrowListArray* list_arr = (GArrowListArray*)column;

        GArrowInt64Array *list_values =
          GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
        gint64 n_list_values;
        const gint64 *raw_list_values =
          garrow_int64_array_get_values(list_values, &n_list_values);
        gint64 n_value_offsets;
        const gint32 *value_offsets =
          garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
        guint j;
        for (j = 0; j < n_value_offsets; ++j) {
          gint32 value_offset = value_offsets[j];
          gint32 value_length = value_offsets[j + 1] - value_offset;
          gint32 k;
          for (k = 0; k < value_length; ++k) {
            raw_list_values[value_offset + k];
          }
        }
        g_object_unref(list_values);

        g_object_unref(column);

        g_object_unref(record_batch);
      }
    }
    g_object_unref(reader);
  }

  g_object_unref(input);

  return EXIT_SUCCESS;
}
----

It takes 0.5sec on my machine.


Thanks,
--
kou

In
 <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
  "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 07:40:06 +0000,
  Ishan Anand <an...@outlook.com> wrote:

> Hi
>
> I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.
>
>  To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
> ```py
> import pyarrow as pa
>
> foo = {
>     "colA": np.arange(0, 1000_000),
>     "colB": [np.arange(1, 5)] * 1000_000
> }
>
> table = pa.table(foo)
> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
>     for _ in range(100):
>         writer.write_table(table)
> ```
>
> However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.
>
> The relevant snippet is this:
> ```C
>     guint num_batches = 100;
>     for (i = 0; i < num_batches; i++) {
>         GArrowRecordBatch *record_batch;
>         record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>
>         GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
>         guint length_list = garrow_array_get_length(column);
>         GArrowListArray* list_arr = (GArrowListArray*)column;
>
>         guint j;
>         GArrowArray* list_elem;
>         for (j = 0; j < length_list; j++) {
>             list_elem = garrow_list_array_get_value(list_arr, j);
>         }
>     }
> ```
>
> I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?
>
>
> Thank you,
> Ishan
>
>
>

Re: [C-GLib] reading values quickly from a list array

Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,

garrow_list_array_get_value() is a bit high cost function
because it creates a sub list array. It doesn't copy array
data (it shares array data) but it creates a new sub array
(container for data) in C++ level and C level.

Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
list array values. Sorry. I've implemented them:
https://github.com/apache/arrow/pull/8119

It'll be included in Apache Arrow GLib 2.0.0 that will be
released in a few months.

(Can you wait 2.0.0?)

With these APIs, you can write like the following:

----
#include <stdlib.h>
#include <arrow-glib/arrow-glib.h>

int
main(void)
{
  GError *error = NULL;

  GArrowMemoryMappedInputStream *input;
  input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
  if (!input) {
    g_print("failed to open file: %s\n", error->message);
    g_error_free(error);
    return EXIT_FAILURE;
  }

  {
    GArrowRecordBatchFileReader *reader;
    reader =
      garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
                                          &error);

    if (!reader) {
      g_print("failed to open file reader: %s\n", error->message);
      g_error_free(error);
      g_object_unref(input);
      return EXIT_FAILURE;
    }

    {
      guint i;
      guint num_batches = 100;
      for (i = 0; i < num_batches; i++) {
        GArrowRecordBatch *record_batch;
        record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);

        GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
        guint length_list = garrow_array_get_length(column);

        GArrowListArray* list_arr = (GArrowListArray*)column;

        GArrowInt64Array *list_values =
          GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
        gint64 n_list_values;
        const gint64 *raw_list_values =
          garrow_int64_array_get_values(list_values, &n_list_values);
        gint64 n_value_offsets;
        const gint32 *value_offsets =
          garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
        guint j;
        for (j = 0; j < n_value_offsets; ++j) {
          gint32 value_offset = value_offsets[j];
          gint32 value_length = value_offsets[j + 1] - value_offset;
          gint32 k;
          for (k = 0; k < value_length; ++k) {
            raw_list_values[value_offset + k];
          }
        }
        g_object_unref(list_values);

        g_object_unref(column);

        g_object_unref(record_batch);
      }
    }
    g_object_unref(reader);
  }

  g_object_unref(input);

  return EXIT_SUCCESS;
}
----

It takes 0.5sec on my machine.


Thanks,
--
kou

In 
 <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
  "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 07:40:06 +0000,
  Ishan Anand <an...@outlook.com> wrote:

> Hi
> 
> I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.
> 
>  To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
> ```py
> import pyarrow as pa
> 
> foo = {
>     "colA": np.arange(0, 1000_000),
>     "colB": [np.arange(1, 5)] * 1000_000
> }
> 
> table = pa.table(foo)
> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
>     for _ in range(100):
>         writer.write_table(table)
> ```
> 
> However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.
> 
> The relevant snippet is this:
> ```C
>     guint num_batches = 100;
>     for (i = 0; i < num_batches; i++) {
>         GArrowRecordBatch *record_batch;
>         record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
> 
>         GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
>         guint length_list = garrow_array_get_length(column);
>         GArrowListArray* list_arr = (GArrowListArray*)column;
> 
>         guint j;
>         GArrowArray* list_elem;
>         for (j = 0; j < length_list; j++) {
>             list_elem = garrow_list_array_get_value(list_arr, j);
>         }
>     }
> ```
> 
> I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?
> 
> 
> Thank you,
> Ishan
> 
> 
>