You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Ishan Anand <an...@outlook.com> on 2020/09/06 07:40:06 UTC
[C-GLib] reading values quickly from a list array
Hi
I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.
To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
```py
import pyarrow as pa
foo = {
"colA": np.arange(0, 1000_000),
"colB": [np.arange(1, 5)] * 1000_000
}
table = pa.table(foo)
with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
for _ in range(100):
writer.write_table(table)
```
However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.
The relevant snippet is this:
```C
guint num_batches = 100;
for (i = 0; i < num_batches; i++) {
GArrowRecordBatch *record_batch;
record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
guint length_list = garrow_array_get_length(column);
GArrowListArray* list_arr = (GArrowListArray*)column;
guint j;
GArrowArray* list_elem;
for (j = 0; j < length_list; j++) {
list_elem = garrow_list_array_get_value(list_arr, j);
}
}
```
I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?
Thank you,
Ishan
Re: [C-GLib] reading values quickly from a list array
Posted by Ishan Anand <an...@outlook.com>.
I'll make sure to do that. Thank you again.
Best,
Ishan
________________________________
From: Sutou Kouhei <ko...@clear-code.com>
Sent: Tuesday, September 8, 2020 2:56 AM
To: user@arrow.apache.org <us...@arrow.apache.org>
Subject: Re: [C-GLib] reading values quickly from a list array
Hi,
I've merged it.
Note that you need to install Apache Arrow C++ (master) before you
install Apache Arrow GLib (master). Apache Arrow GLib
depends on Apache Arrow C++.
Thanks,
--
kou
In
<CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
"Re: [C-GLib] reading values quickly from a list array " on Mon, 7 Sep 2020 04:54:24 +0000,
Ishan Anand <an...@outlook.com> wrote:
> Thank you very much for the commit Kouhei-san. I'd love to use it sooner so I'll use the source code directly to build Arrow-glib once this PR is in.
>
>
> Thank you,
> Ishan
> ________________________________
> From: Sutou Kouhei <ko...@clear-code.com>
> Sent: Monday, September 7, 2020 6:44 AM
> To: user@arrow.apache.org <us...@arrow.apache.org>
> Subject: Re: [C-GLib] reading values quickly from a list array
>
> Hi,
>
> garrow_list_array_get_value() is a bit high cost function
> because it creates a sub list array. It doesn't copy array
> data (it shares array data) but it creates a new sub array
> (container for data) in C++ level and C level.
>
> Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
> list array values. Sorry. I've implemented them:
> https://github.com/apache/arrow/pull/8119
>
> It'll be included in Apache Arrow GLib 2.0.0 that will be
> released in a few months.
>
> (Can you wait 2.0.0?)
>
> With these APIs, you can write like the following:
>
> ----
> #include <stdlib.h>
> #include <arrow-glib/arrow-glib.h>
>
> int
> main(void)
> {
> GError *error = NULL;
>
> GArrowMemoryMappedInputStream *input;
> input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
> if (!input) {
> g_print("failed to open file: %s\n", error->message);
> g_error_free(error);
> return EXIT_FAILURE;
> }
>
> {
> GArrowRecordBatchFileReader *reader;
> reader =
> garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
> &error);
>
> if (!reader) {
> g_print("failed to open file reader: %s\n", error->message);
> g_error_free(error);
> g_object_unref(input);
> return EXIT_FAILURE;
> }
>
> {
> guint i;
> guint num_batches = 100;
> for (i = 0; i < num_batches; i++) {
> GArrowRecordBatch *record_batch;
> record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>
> GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
> guint length_list = garrow_array_get_length(column);
>
> GArrowListArray* list_arr = (GArrowListArray*)column;
>
> GArrowInt64Array *list_values =
> GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
> gint64 n_list_values;
> const gint64 *raw_list_values =
> garrow_int64_array_get_values(list_values, &n_list_values);
> gint64 n_value_offsets;
> const gint32 *value_offsets =
> garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
> guint j;
> for (j = 0; j < n_value_offsets; ++j) {
> gint32 value_offset = value_offsets[j];
> gint32 value_length = value_offsets[j + 1] - value_offset;
> gint32 k;
> for (k = 0; k < value_length; ++k) {
> raw_list_values[value_offset + k];
> }
> }
> g_object_unref(list_values);
>
> g_object_unref(column);
>
> g_object_unref(record_batch);
> }
> }
> g_object_unref(reader);
> }
>
> g_object_unref(input);
>
> return EXIT_SUCCESS;
> }
> ----
>
> It takes 0.5sec on my machine.
>
>
> Thanks,
> --
> kou
>
> In
> <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
> "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 07:40:06 +0000,
> Ishan Anand <an...@outlook.com> wrote:
>
>> Hi
>>
>> I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.
>>
>> To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
>> ```py
>> import pyarrow as pa
>>
>> foo = {
>> "colA": np.arange(0, 1000_000),
>> "colB": [np.arange(1, 5)] * 1000_000
>> }
>>
>> table = pa.table(foo)
>> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
>> for _ in range(100):
>> writer.write_table(table)
>> ```
>>
>> However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.
>>
>> The relevant snippet is this:
>> ```C
>> guint num_batches = 100;
>> for (i = 0; i < num_batches; i++) {
>> GArrowRecordBatch *record_batch;
>> record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>>
>> GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
>> guint length_list = garrow_array_get_length(column);
>> GArrowListArray* list_arr = (GArrowListArray*)column;
>>
>> guint j;
>> GArrowArray* list_elem;
>> for (j = 0; j < length_list; j++) {
>> list_elem = garrow_list_array_get_value(list_arr, j);
>> }
>> }
>> ```
>>
>> I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?
>>
>>
>> Thank you,
>> Ishan
>>
>>
>>
Re: [C-GLib] reading values quickly from a list array
Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,
I've merged it.
Note that you need to install Apache Arrow C++ (master) before you
install Apache Arrow GLib (master). Apache Arrow GLib
depends on Apache Arrow C++.
Thanks,
--
kou
In
<CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
"Re: [C-GLib] reading values quickly from a list array " on Mon, 7 Sep 2020 04:54:24 +0000,
Ishan Anand <an...@outlook.com> wrote:
> Thank you very much for the commit Kouhei-san. I'd love to use it sooner so I'll use the source code directly to build Arrow-glib once this PR is in.
>
>
> Thank you,
> Ishan
> ________________________________
> From: Sutou Kouhei <ko...@clear-code.com>
> Sent: Monday, September 7, 2020 6:44 AM
> To: user@arrow.apache.org <us...@arrow.apache.org>
> Subject: Re: [C-GLib] reading values quickly from a list array
>
> Hi,
>
> garrow_list_array_get_value() is a bit high cost function
> because it creates a sub list array. It doesn't copy array
> data (it shares array data) but it creates a new sub array
> (container for data) in C++ level and C level.
>
> Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
> list array values. Sorry. I've implemented them:
> https://github.com/apache/arrow/pull/8119
>
> It'll be included in Apache Arrow GLib 2.0.0 that will be
> released in a few months.
>
> (Can you wait 2.0.0?)
>
> With these APIs, you can write like the following:
>
> ----
> #include <stdlib.h>
> #include <arrow-glib/arrow-glib.h>
>
> int
> main(void)
> {
> GError *error = NULL;
>
> GArrowMemoryMappedInputStream *input;
> input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
> if (!input) {
> g_print("failed to open file: %s\n", error->message);
> g_error_free(error);
> return EXIT_FAILURE;
> }
>
> {
> GArrowRecordBatchFileReader *reader;
> reader =
> garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
> &error);
>
> if (!reader) {
> g_print("failed to open file reader: %s\n", error->message);
> g_error_free(error);
> g_object_unref(input);
> return EXIT_FAILURE;
> }
>
> {
> guint i;
> guint num_batches = 100;
> for (i = 0; i < num_batches; i++) {
> GArrowRecordBatch *record_batch;
> record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>
> GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
> guint length_list = garrow_array_get_length(column);
>
> GArrowListArray* list_arr = (GArrowListArray*)column;
>
> GArrowInt64Array *list_values =
> GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
> gint64 n_list_values;
> const gint64 *raw_list_values =
> garrow_int64_array_get_values(list_values, &n_list_values);
> gint64 n_value_offsets;
> const gint32 *value_offsets =
> garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
> guint j;
> for (j = 0; j < n_value_offsets; ++j) {
> gint32 value_offset = value_offsets[j];
> gint32 value_length = value_offsets[j + 1] - value_offset;
> gint32 k;
> for (k = 0; k < value_length; ++k) {
> raw_list_values[value_offset + k];
> }
> }
> g_object_unref(list_values);
>
> g_object_unref(column);
>
> g_object_unref(record_batch);
> }
> }
> g_object_unref(reader);
> }
>
> g_object_unref(input);
>
> return EXIT_SUCCESS;
> }
> ----
>
> It takes 0.5sec on my machine.
>
>
> Thanks,
> --
> kou
>
> In
> <CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
> "[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 07:40:06 +0000,
> Ishan Anand <an...@outlook.com> wrote:
>
>> Hi
>>
>> I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.
>>
>> To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
>> ```py
>> import pyarrow as pa
>>
>> foo = {
>> "colA": np.arange(0, 1000_000),
>> "colB": [np.arange(1, 5)] * 1000_000
>> }
>>
>> table = pa.table(foo)
>> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
>> for _ in range(100):
>> writer.write_table(table)
>> ```
>>
>> However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.
>>
>> The relevant snippet is this:
>> ```C
>> guint num_batches = 100;
>> for (i = 0; i < num_batches; i++) {
>> GArrowRecordBatch *record_batch;
>> record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>>
>> GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
>> guint length_list = garrow_array_get_length(column);
>> GArrowListArray* list_arr = (GArrowListArray*)column;
>>
>> guint j;
>> GArrowArray* list_elem;
>> for (j = 0; j < length_list; j++) {
>> list_elem = garrow_list_array_get_value(list_arr, j);
>> }
>> }
>> ```
>>
>> I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?
>>
>>
>> Thank you,
>> Ishan
>>
>>
>>
Re: [C-GLib] reading values quickly from a list array
Posted by Ishan Anand <an...@outlook.com>.
Thank you very much for the commit Kouhei-san. I'd love to use it sooner so I'll use the source code directly to build Arrow-glib once this PR is in.
Thank you,
Ishan
________________________________
From: Sutou Kouhei <ko...@clear-code.com>
Sent: Monday, September 7, 2020 6:44 AM
To: user@arrow.apache.org <us...@arrow.apache.org>
Subject: Re: [C-GLib] reading values quickly from a list array
Hi,
garrow_list_array_get_value() is a bit high cost function
because it creates a sub list array. It doesn't copy array
data (it shares array data) but it creates a new sub array
(container for data) in C++ level and C level.
Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
list array values. Sorry. I've implemented them:
https://github.com/apache/arrow/pull/8119
It'll be included in Apache Arrow GLib 2.0.0 that will be
released in a few months.
(Can you wait 2.0.0?)
With these APIs, you can write like the following:
----
#include <stdlib.h>
#include <arrow-glib/arrow-glib.h>
int
main(void)
{
GError *error = NULL;
GArrowMemoryMappedInputStream *input;
input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
if (!input) {
g_print("failed to open file: %s\n", error->message);
g_error_free(error);
return EXIT_FAILURE;
}
{
GArrowRecordBatchFileReader *reader;
reader =
garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
&error);
if (!reader) {
g_print("failed to open file reader: %s\n", error->message);
g_error_free(error);
g_object_unref(input);
return EXIT_FAILURE;
}
{
guint i;
guint num_batches = 100;
for (i = 0; i < num_batches; i++) {
GArrowRecordBatch *record_batch;
record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
guint length_list = garrow_array_get_length(column);
GArrowListArray* list_arr = (GArrowListArray*)column;
GArrowInt64Array *list_values =
GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
gint64 n_list_values;
const gint64 *raw_list_values =
garrow_int64_array_get_values(list_values, &n_list_values);
gint64 n_value_offsets;
const gint32 *value_offsets =
garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
guint j;
for (j = 0; j < n_value_offsets; ++j) {
gint32 value_offset = value_offsets[j];
gint32 value_length = value_offsets[j + 1] - value_offset;
gint32 k;
for (k = 0; k < value_length; ++k) {
raw_list_values[value_offset + k];
}
}
g_object_unref(list_values);
g_object_unref(column);
g_object_unref(record_batch);
}
}
g_object_unref(reader);
}
g_object_unref(input);
return EXIT_SUCCESS;
}
----
It takes 0.5sec on my machine.
Thanks,
--
kou
In
<CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
"[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 07:40:06 +0000,
Ishan Anand <an...@outlook.com> wrote:
> Hi
>
> I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.
>
> To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
> ```py
> import pyarrow as pa
>
> foo = {
> "colA": np.arange(0, 1000_000),
> "colB": [np.arange(1, 5)] * 1000_000
> }
>
> table = pa.table(foo)
> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
> for _ in range(100):
> writer.write_table(table)
> ```
>
> However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.
>
> The relevant snippet is this:
> ```C
> guint num_batches = 100;
> for (i = 0; i < num_batches; i++) {
> GArrowRecordBatch *record_batch;
> record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>
> GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
> guint length_list = garrow_array_get_length(column);
> GArrowListArray* list_arr = (GArrowListArray*)column;
>
> guint j;
> GArrowArray* list_elem;
> for (j = 0; j < length_list; j++) {
> list_elem = garrow_list_array_get_value(list_arr, j);
> }
> }
> ```
>
> I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?
>
>
> Thank you,
> Ishan
>
>
>
Re: [C-GLib] reading values quickly from a list array
Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,
garrow_list_array_get_value() is a bit high cost function
because it creates a sub list array. It doesn't copy array
data (it shares array data) but it creates a new sub array
(container for data) in C++ level and C level.
Apache Arrow GLib 1.0.1 doesn't have low level APIs to access
list array values. Sorry. I've implemented them:
https://github.com/apache/arrow/pull/8119
It'll be included in Apache Arrow GLib 2.0.0 that will be
released in a few months.
(Can you wait 2.0.0?)
With these APIs, you can write like the following:
----
#include <stdlib.h>
#include <arrow-glib/arrow-glib.h>
int
main(void)
{
GError *error = NULL;
GArrowMemoryMappedInputStream *input;
input = garrow_memory_mapped_input_stream_new("/tmp/batch.arrow", &error);
if (!input) {
g_print("failed to open file: %s\n", error->message);
g_error_free(error);
return EXIT_FAILURE;
}
{
GArrowRecordBatchFileReader *reader;
reader =
garrow_record_batch_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input),
&error);
if (!reader) {
g_print("failed to open file reader: %s\n", error->message);
g_error_free(error);
g_object_unref(input);
return EXIT_FAILURE;
}
{
guint i;
guint num_batches = 100;
for (i = 0; i < num_batches; i++) {
GArrowRecordBatch *record_batch;
record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
guint length_list = garrow_array_get_length(column);
GArrowListArray* list_arr = (GArrowListArray*)column;
GArrowInt64Array *list_values =
GARROW_INT64_ARRAY(garrow_list_array_get_values(list_arr));
gint64 n_list_values;
const gint64 *raw_list_values =
garrow_int64_array_get_values(list_values, &n_list_values);
gint64 n_value_offsets;
const gint32 *value_offsets =
garrow_list_array_get_value_offsets(list_arr, &n_value_offsets);
guint j;
for (j = 0; j < n_value_offsets; ++j) {
gint32 value_offset = value_offsets[j];
gint32 value_length = value_offsets[j + 1] - value_offset;
gint32 k;
for (k = 0; k < value_length; ++k) {
raw_list_values[value_offset + k];
}
}
g_object_unref(list_values);
g_object_unref(column);
g_object_unref(record_batch);
}
}
g_object_unref(reader);
}
g_object_unref(input);
return EXIT_SUCCESS;
}
----
It takes 0.5sec on my machine.
Thanks,
--
kou
In
<CH...@CH2PR20MB3095.namprd20.prod.outlook.com>
"[C-GLib] reading values quickly from a list array " on Sun, 6 Sep 2020 07:40:06 +0000,
Ishan Anand <an...@outlook.com> wrote:
> Hi
>
> I am trying to use the Arrow Glib API to read/write from C. Specifically, while Arrow is a columnar format, I'm really excited to be able to write a lot of rows from a C like runtime and access it from python for analytics as an array per column. And vice versa.
>
> To get a quick example running, I created an Arrow table in python with 100 million entries as follows:
> ```py
> import pyarrow as pa
>
> foo = {
> "colA": np.arange(0, 1000_000),
> "colB": [np.arange(1, 5)] * 1000_000
> }
>
> table = pa.table(foo)
> with pa.RecordBatchFileWriter("/tmp/batch.arrow", table.schema) as writer:
> for _ in range(100):
> writer.write_table(table)
> ```
>
> However, using the Glib API to read the ListArray column data looks really slow. It takes like 5 seconds per record batch with a million entries. While the integer column over the entire table can be iterated over under 2 seconds.
>
> The relevant snippet is this:
> ```C
> guint num_batches = 100;
> for (i = 0; i < num_batches; i++) {
> GArrowRecordBatch *record_batch;
> record_batch = garrow_record_batch_file_reader_read_record_batch(reader, i, &error);
>
> GArrowArray* column = garrow_record_batch_get_column_data(record_batch, 1);
> guint length_list = garrow_array_get_length(column);
> GArrowListArray* list_arr = (GArrowListArray*)column;
>
> guint j;
> GArrowArray* list_elem;
> for (j = 0; j < length_list; j++) {
> list_elem = garrow_list_array_get_value(list_arr, j);
> }
> }
> ```
>
> I can't seem to find a quicker alternative in the public Glib API to read data out of a list array. Is there a way to speed up this loop?
>
>
> Thank you,
> Ishan
>
>
>