You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "egillax (via GitHub)" <gi...@apache.org> on 2023/04/21 12:39:05 UTC

[GitHub] [arrow] egillax opened a new issue, #35268: OrderBy with spillover

egillax opened a new issue, #35268:
URL: https://github.com/apache/arrow/issues/35268

   ### Describe the enhancement requested
   
   Hi everyone,
   
   I'm representing a group of researchers that is working with observational health data. We have an [ecosystem of packages mostly in R](https://github.com/OHDSI/) and have been exploring using `arrow` as a backend for working with our data instead of `sqlite`. We've been impressed by the speed improvements and were almost ready to make the switch but we've hid a roadblock. 
   
   The current sorting in arrow (using ```dplyr::arrange```) is taking to much memory. Looking at it further I see this [operation](https://arrow.apache.org/docs/dev/cpp/streaming_execution.html#order-by-sink) is a `pipeline breaker` and seems to accumulate everything in memory before sorting with a single thread. 
   
   I also see mentioned in many places the plan is to improve this and add spillover mechanisms to the sort and other `pipeline breakers`.
   
   I did a small comparison between `arrow`, our current solution, `duckdb` and `dplyr`. I measured time and max memory with `gnu time`
   
   |                                        |                               | Small                | Medium           |
   | -------------------------------------- | ----------------------------- | -------------------- | ---------------- |
   |                                        | memory after dplyr::compute() | 1.1 GB               | 5.1 GB           |
   | arrow (arrange and then write_dataset) | memory                        | 3.1 GB               | 14.1 GB          |
   |                                   | time                          | 1 minute 12 sec      | 8 minutes 46 sec |
   | dplyr (collect and then arrange)       | memory                        | 3.6 GB               | 15.9 GB          |
   |                                        | time                          | 11 seconds           | 1 minute         |
   | duckdb (from parquet files)            | memory                        | 4.3 GB               | 19.3 GB          |
   |                                        | time                          | 4 seconds            | 21 seconds       |
   | Our current solution (uses sqlite)     | memory                        | 240 MB               | 260 MB           |
   |                                        | time                          | 2 minutes 30 seconds | 13 min 22 sec    |
   
   As you can see our current solution is slow but will never run out of memory. 
   
   It would be very nice if spillover was added to the sort in arrow so we could specify a memory limit to ensure we don't run out of memory and sort larger than memory data. I hope you would even consider this feature in the near future (even for arrow `13.0.0`).
   
   I just wanted to make this issue to make you aware this is a blocker for us at the moment. We don't have the c++ knowledge to contribute to a solution for this, but would be glad to help if changes of R bindings would be needed and of course with testing.
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35268: [C++] OrderBy with spillover

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1517853757

   Thanks for the detailed writeup and investigation.  I agree this is a top priority for Acero although I unfortunately have been very busy these past few months and I don't know if that will change so I can't really promise anything.
   
   There are a few approaches that can be taken here and https://en.wikipedia.org/wiki/External_sorting provides a good summary / starting point for an investigation.  Some of these approaches require a merge step.  So some kind of `merge_indices` / `merge` function (e.g. https://github.com/apache/arrow/issues/33512) would be essential.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1565526594

   Hi, @westonpace, just to confirm my understanding about async implmentation in arrow. 
   For this issue, we have `async reader` in arrow but it seems that there is no async implementation of `writer`/`writerTable`. Is it ok I use C++ async to implement the wirting process?  When the async writing is supported by arrow, we can shift to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1629193926

   It seems that I can use Scalar to obtain a single element and compare it with the Function you mentioned. So the problem could be solved with low engineering cost.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35268: [C++] OrderBy with spillover

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1585258861

   Are you trying to read a single row?  Or a whole batch of rows?
   
   If you need random access to individual rows then parquet is not going to be a good fit.  We might want to investigate some kind of row-major format.
   
   If you only need to load specific batches of data then could you create a row group for each batch?  Or a separate file for each batch?
   
   If you need random access to batches of data (e.g. you don't know the row group boundaries at write time but it isn't random access to rows) then we could maybe use the row skip feature that was recently added to parquet (I don't think it has been exposed yet).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35268: [C++] OrderBy with spillover

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1538959709

   Great!  You might take a look at https://github.com/apache/arrow/pull/35320 as you're getting started.  It helps give some overview of Acero in general.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++] OrderBy with spillover [arrow]

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1795191714

   hi, @egillax, sorry for replying so late. I am stuck with the implementation for sorting, the performance of external merge sort of my implementation is not good. I have a plan to continue this PR in the late Nov. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1555318525

> The n-way merge can then call `FetchBatch` appropriately. An n-way merge is going to be challenging to implement performantly because it is not a columnar algorithm. Thinking about this more, the n-way merge kernel will probably not be a compute function. You will probably want to use something like `ExecBatchBuilder` to accumulate the results.

Helpful suggestions, It is indeed challenging to implement external merge sort performantly.
Draft plan:
1. In `InsertBatch`, we do what you comment says.
2. we do n-way merge in buffer. We could also just compare the columns needed for sorting and take the indices, whose format should be {batch_index}_{indice}, to materialize a result.

> Keep in mind that there is another approach which will not require an n-way merge (external distribution sort). This approach may be simpler to implement but I don't know.

I have roughly investigated external distribution sort. May be we shouldn't choose it.
The external distribution sort prefers n-pivots could divide the entire data evenly. However, It's hard to get the perfect n-pivots in the top n rounds of `InsertBatch` unless every batch is identically distribution. If the n-pivots can't divide the data evenly, the sorting would call **extra IO**, compared to external merge sort, to recusively divide data untill the smallest segment fits buffer.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1620411919

   > Yes. But, luckily, in this case the functions should already exist: https://gist.github.com/westonpace/a45738e5a356324d410cba2c2713b1fd
   
   No, the compare Functions currently don't support comparing between single elements in array. Like  `a_array->data()->GetValues<CType>(a_indice) > b_array->data()-><Ctype>GetValues(b_indice)`. These Function can be used to compare two array entirely. But merge sort only need to compare the min/max keys between different arrays, comparing the entire arrays is too costly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1602034985

> Yes, I think that would solve your problem. For example, is this similar to how the `file_parquet.cc` file uses `parquet::arrow::FileReader::GetRecordBatchGenerator`?

Yes, it's similar.

I am closing to finish the first draft for this issue and I confront another problem may need your guidance. How can I compare two values from two arrays in the run time in an Acero node.
To be specific, I can use the following codes to compare:
`a_array->data()->GetValues<CType>(a_indice) > b_array->data()-><Ctype>GetValues(b_indice)`.
To make this code work, the CType need to be known in the compile time. However, it looks like I can only get the CType in run time for an Array from the input of an Acero node.

I know, the current solution in `arrow::compute` materializes all the `arrow::DataType` for a compute kernel and register them to a Function. So, in the run time, I can use Function to handle data in all the `arrow::DataType`.

To solve my problem, do I need to create a Function and register corresponding kernels for comparison?

Thanks

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35268: [C++] OrderBy with spillover

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1604636046

   > To solve my problem, do I need to create a Function and add corresponding kernels to it for comparison?
   Or, do we have any simpler method to solve the problem?
   
   Yes.  But, luckily, in this case the functions should already exist: https://gist.github.com/westonpace/a45738e5a356324d410cba2c2713b1fd


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35268: [C++] OrderBy with spillover

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1589678182

   Yes, I think that would solve your problem.  For example, is this similar to how the `file_parquet.cc` file uses `parquet::arrow::FileReader::GetRecordBatchGenerator`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++] OrderBy with spillover [arrow]

Posted by "egillax (via GitHub)" <gi...@apache.org>.

egillax commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1761120205

   Hi, @R-JunmingChen
   
   Are you still working on this issue? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Liamtoha commented on issue #35268: [C++] OrderBy with spillover

Posted by "Liamtoha (via GitHub)" <gi...@apache.org>.

Liamtoha commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1528160830

   > ### Describe the enhancement requested
   > Hi everyone,
   > 
   > I'm representing a group of researchers that is working with observational health data. We have an [ecosystem of packages mostly in R](https://github.com/OHDSI/) and have been exploring using `arrow` as a backend for working with our data instead of `sqlite`. We've been impressed by the speed improvements and were almost ready to make the switch but we've hid a roadblock.
   > 
   > The current sorting in arrow (using `dplyr::arrange`) is taking to much memory. Looking at it further I see this [operation](https://arrow.apache.org/docs/dev/cpp/streaming_execution.html#order-by-sink) is a `pipeline breaker` and seems to accumulate everything in memory before sorting with a single thread.
   > 
   > I also see mentioned in many places the plan is to improve this and add spillover mechanisms to the sort and other `pipeline breakers`.
   > 
   > I did a small comparison between `arrow`, our current solution, `duckdb` and `dplyr`. I measured time and max memory with `gnu time`
   > 
   > Small	Medium
   > memory after dplyr::compute()	1.1 GB	5.1 GB
   > arrow (arrange and then write_dataset)	memory	3.1 GB	14.1 GB
   > time	1 minute 12 sec	8 minutes 46 sec
   > dplyr (collect and then arrange)	memory	3.6 GB	15.9 GB
   > time	11 seconds	1 minute
   > duckdb (from parquet files)	memory	4.3 GB	19.3 GB
   > time	4 seconds	21 seconds
   > Our current solution (uses sqlite)	memory	240 MB	260 MB
   > time	2 minutes 30 seconds	13 min 22 sec
   > As you can see our current solution is slow but will never run out of memory.
   > 
   > It would be very nice if spillover was added to the sort in arrow so we could specify a memory limit to ensure we don't run out of memory and sort larger than memory data. I hope you would even consider this feature in the near future (even for arrow `13.0.0`).
   > 
   > I just wanted to make this issue to make you aware this is a blocker for us at the moment. We don't have the c++ knowledge to contribute to a solution for this, but would be glad to help if changes of R bindings would be needed and of course with testing.
   > 
   > ### Component(s)
   > C++
   
   Commit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1534068312

   Hi, I wanna take this issue. But I am busy this month, if there is no other contributors working on this, I would like to work on this in the end of this month.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1576946749

   Hi, @westonpace, I am stuck with reading back data. I need to read back data with specific offset and batch size in spillover files.
   Currenty, my offset/batch size counts on the number of rows and I use parquet format for temporal spillover.
   However, our parquet lib doesn't support read file with row offset( only the byte offset are supported). Besides, the recordBatchReader, which is a generator, supports batch_size with the number of row but it doesn't support offset.
   
   May be AsyncGenerator is a good choice to utilize recordBatchReader, but I don't find examples for reading parquet file, the test cases of it are all memory operations. If the AsyncGenerator could resolve the problem, could you please show me a simple example?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1543979250

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1546690050

> Great! You might take a look at #35320 as you're getting started. It helps give some overview of Acero in general.

I am stuck with how to create a appropriate design for external sort in acero.
With your doc, I think it's good if I first create a external sorting kernel and then use it in acero.
But acero needs seperate parts of external sorting. To be acurate:
- In the `InputReceived()` of execnode `order_by_sink` with spillover, we should read data to buffer with size of `buffer_size` . When the buffer is fully filled, we should sort the buffer and write the sorted data to disk.
- In the `DoFinish()` of execnode `order_by_sink` with spillover, we should perform a n-way external merge sorting.

So I have two simple plan,
1. Implement an externalOrderByImpl and implement above methods on its own. In this way, we do not implement anything in compute kernel. May be we still need to call the sorting method in compute in `InputReceived()`.
2. Implement two kernels, one does the spillover things, one dose the n-way external merge sorting. And externalOrderByImpl just call the two kernel in its `InputReceived()` and `DoFinish()`. It will be a little weird, cause I think the two kernels don't conform arrow's design principle for kernel.

I need some suggestions before I implement codes for external sorting.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] R-JunmingChen commented on issue #35268: [C++] OrderBy with spillover

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.

R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1585477388

>If you only need to load specific batches of data then could you create a row group for each batch? Or a separate file for each batch?

I can't know the batch_size of reading when I write it to disk so I can't create a row group/file with a suitable size. Since the batch_size of reading is deicded by spill_over_count and buffer_size ( like buffer_size / spill_over_count ), the spill_over_count can't be determined untill all the inputs are finished.

> If you need random access to batches of data (e.g. you don't know the row group boundaries at write time but it isn't random access to rows) then we could maybe use the row skip feature that was recently added to parquet (I don't think it has been exposed yet).

Sorry for my confused description. The real problem is that I wanna make this `Future<std::optional<ExecBatch>> FetchNextBatch(int spill_index);` work. So, for a specific `example_spill_over_file_one.parquet`, I should read with `row_offset` of `batch_size * 0` and batch size of `batch_size` at fisrt and when I use up the data for comparing I should then read `row_offset` of `batch_size * 1` and batch size of `batch_size`...... and so on.

The skip feature can solve my problem more easily.
Currently, I have used AsyncGenerator like what the **source_node.cc** does to read back data. I think it's enough to solve my problem?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35268: [C++] OrderBy with spillover

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1552967192

`order_by_sink` is the old way. We should be migrating to use `order_by` in `order_by_node.cc`. You should make these changes there.

> It will be a little weird, cause I think the two kernels don't conform arrow's design principle for kernel.

I agree that compute kernels are not best for some of this

> we should sort the buffer and write the sorted data to disk.

This can be a kernel (this already exists with SortIndices).

> Implement two kernels in arrow::compute, one does the spillover things

Kernels should not write to disk. I think we should create a separate abstraction, a spilling accumulation queue, that writes to disk. For example, it could be something like...

```
class OrderedSpillingAccumulationQueue {
public:

OrderedSpillingAccumulationQueue(int64_t buffer_size);

// Inserts a batch into the queue. This may trigger a write to disk if enough data is accumulated
// If it does, then SpillCount should be incremented before this method returns (but the write can
// happen in the background, asynchronously)
Status InsertBatch(ExecBatch batch);

// The number of files that have been written to disk. This should also include any data in memory
// so it will be the number of files written to disk + 1 if there is in-memory data.
int SpillCount();

Future<ExecBatch> FetchBatch(int spill_index, int index_in_spill);
};
```

The n-way merge can then call `FetchBatch` appropriately. An n-way merge is going to be challenging to implement performantly because it is not a columnar algorithm. Thinking about this more, the n-way merge kernel will probably not be a compute function. You will probably want to use something like `ExecBatchBuilder` to accumulate the results.

Keep in mind that there is another approach which will not require an n-way merge (external distribution sort). This approach may be simpler to implement.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org