You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Liamtoha (via GitHub)" <gi...@apache.org> on 2023/04/28 22:33:44 UTC
[GitHub] [arrow] Liamtoha commented on issue #35268: [C++] OrderBy with spillover

Liamtoha commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1528160830

   > ### Describe the enhancement requested
   > Hi everyone,
   > 
   > I'm representing a group of researchers that is working with observational health data. We have an [ecosystem of packages mostly in R](https://github.com/OHDSI/) and have been exploring using `arrow` as a backend for working with our data instead of `sqlite`. We've been impressed by the speed improvements and were almost ready to make the switch but we've hid a roadblock.
   > 
   > The current sorting in arrow (using `dplyr::arrange`) is taking to much memory. Looking at it further I see this [operation](https://arrow.apache.org/docs/dev/cpp/streaming_execution.html#order-by-sink) is a `pipeline breaker` and seems to accumulate everything in memory before sorting with a single thread.
   > 
   > I also see mentioned in many places the plan is to improve this and add spillover mechanisms to the sort and other `pipeline breakers`.
   > 
   > I did a small comparison between `arrow`, our current solution, `duckdb` and `dplyr`. I measured time and max memory with `gnu time`
   > 
   > Small	Medium
   > memory after dplyr::compute()	1.1 GB	5.1 GB
   > arrow (arrange and then write_dataset)	memory	3.1 GB	14.1 GB
   > time	1 minute 12 sec	8 minutes 46 sec
   > dplyr (collect and then arrange)	memory	3.6 GB	15.9 GB
   > time	11 seconds	1 minute
   > duckdb (from parquet files)	memory	4.3 GB	19.3 GB
   > time	4 seconds	21 seconds
   > Our current solution (uses sqlite)	memory	240 MB	260 MB
   > time	2 minutes 30 seconds	13 min 22 sec
   > As you can see our current solution is slow but will never run out of memory.
   > 
   > It would be very nice if spillover was added to the sort in arrow so we could specify a memory limit to ensure we don't run out of memory and sort larger than memory data. I hope you would even consider this feature in the near future (even for arrow `13.0.0`).
   > 
   > I just wanted to make this issue to make you aware this is a blocker for us at the moment. We don't have the c++ knowledge to contribute to a solution for this, but would be glad to help if changes of R bindings would be needed and of course with testing.
   > 
   > ### Component(s)
   > C++
   
   Commit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org