You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2022/09/04 09:22:23 UTC
[GitHub] [arrow-julia] bilelomrani1 opened a new issue, #334: Streaming: Pyarrow is 15 times faster than Arrow.jl
bilelomrani1 opened a new issue, #334:
URL: https://github.com/apache/arrow-julia/issues/334
I have an `.arrow` file generated with `pyarrow` whose schema is the following:
```
input: struct<open: fixed_size_list<item: float>[512], high: fixed_size_list<item: float>[512], low: fixed_size_list<item: float>[512], close: fixed_size_list<item: float>[512]> not null
child 0, open: fixed_size_list<item: float>[512]
child 0, item: float
child 1, high: fixed_size_list<item: float>[512]
child 0, item: float
child 2, low: fixed_size_list<item: float>[512]
child 0, item: float
child 3, close: fixed_size_list<item: float>[512]
child 0, item: float
```
With `pyarrow`, I load and iterate over records with the following:
```python
with pa.memory_map('arraydata.arrow', 'r') as source:
loaded_arrays = pa.ipc.open_file(source).read_all()
a = 0
for batch in loaded_arrays.to_batches():
for input_candles in batch["input"]:
a += 1
```
Iterating over my example file (~10,000 lines) takes 210 ms.
In julia, I load and iterate over the same file with the following:
```julia
stream = Arrow.Stream("./arraydata.arrow")
function bench_iteration(stream)
a = 0
for batch in stream
for sample in batch.input
a += 1
end
end
end
@btime bench_iteration($stream)
```
```
3.169 s (25272097 allocations: 1.70 GiB)
```
Iterating over records takes 15 more time with `Arrow.jl`. Am I doing something wrong?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org