You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2022/09/04 09:22:23 UTC

[GitHub] [arrow-julia] bilelomrani1 opened a new issue, #334: Streaming: Pyarrow is 15 times faster than Arrow.jl

bilelomrani1 opened a new issue, #334:
URL: https://github.com/apache/arrow-julia/issues/334

   I have an `.arrow` file generated with `pyarrow` whose schema is the following:
   ```
   input: struct<open: fixed_size_list<item: float>[512], high: fixed_size_list<item: float>[512], low: fixed_size_list<item: float>[512], close: fixed_size_list<item: float>[512]> not null
     child 0, open: fixed_size_list<item: float>[512]
         child 0, item: float
     child 1, high: fixed_size_list<item: float>[512]
         child 0, item: float
     child 2, low: fixed_size_list<item: float>[512]
         child 0, item: float
     child 3, close: fixed_size_list<item: float>[512]
         child 0, item: float
   ```
   
   With `pyarrow`, I load and iterate over records with the following:
   ```python
   with pa.memory_map('arraydata.arrow', 'r') as source:
       loaded_arrays = pa.ipc.open_file(source).read_all()
   
   a = 0
   for batch in loaded_arrays.to_batches():
       for input_candles in batch["input"]:
           a += 1
   ```
   Iterating over my example file (~10,000 lines) takes 210 ms.
   
   In julia, I load and iterate over the same file with the following:
   
   ```julia
   stream = Arrow.Stream("./arraydata.arrow")
   
   function bench_iteration(stream)
       a = 0
       for batch in stream
           for sample in batch.input
               a += 1
           end
       end
   end
   
   @btime bench_iteration($stream)
   ```
   
   ```
   3.169 s (25272097 allocations: 1.70 GiB)
   ```
   
   Iterating over records takes 15 more time with `Arrow.jl`. Am I doing something wrong?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org