You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "metesynnada (via GitHub)" <gi...@apache.org> on 2023/05/02 15:14:06 UTC

[GitHub] [arrow-datafusion] metesynnada commented on pull request #6154: Simplify MemoryWriteExec

metesynnada commented on PR #6154:
URL: https://github.com/apache/arrow-datafusion/pull/6154#issuecomment-1531656243

   _TLDR; it is recommended to retain both versions of the lock mechanism to accommodate both scenarios where the partitions match and where they do not match._
   
   So, we ran a benchmark to see how two different locking strategies perform when processing a stream of `RecordBatch` objects. Here are the two functions we compared:
   
   1. `lock_only_once`:
   
   ```rust
   fn lock_only_once(schema: SchemaRef, state: (SendableRecordBatchStream, Arc<RwLock<Vec<RecordBatch>>>)) -> SendableRecordBatchStream{
       let iter = futures::stream::unfold(state, |mut state| async move {
           let mut locked = state.1.write().await;
           loop {
               let batch = match state.0.next().await {
                   Some(Ok(batch)) => batch,
                   Some(Err(e)) => {
                       drop(locked);
                       return Some((Err(e), state))
                   }
                   None => {
                       drop(locked);
                       return None
                   }
               };
               locked.push(batch)
           }
       });
       Box::pin(RecordBatchStreamAdapter::new(schema, iter))
   }
   ```
   
   2. `lock_multiple`:
   
   ```rust
   fn lock_multiple(schema: SchemaRef, state: (SendableRecordBatchStream, Arc<RwLock<Vec<RecordBatch>>>)) -> SendableRecordBatchStream{
       let iter = futures::stream::unfold(state, |mut state| async move {
           loop {
               let batch = match state.0.next().await {
                   Some(Ok(batch)) => batch,
                   Some(Err(e)) => return Some((Err(e), state)),
                   None => return None,
               };
               state.1.write().await.push(batch)
           }
       });
       Box::pin(RecordBatchStreamAdapter::new(schema, iter))
   }
   ```
   
   We tested these functions with an input size of 10,000, and here's what we found:
   
   **Multiple lock strategy (lock_multiple function):**
   
   - Mean execution time: ***545.97 µs***
   - Range: [544.78 µs, 547.26 µs]
   - A couple of outliers: 2 (2.00%) high mild
   
   **Single lock strategy (lock_only_once function):**
   
   - Mean execution time: ***330.21 µs***
   - Range: [329.53 µs, 330.94 µs]
   - A few outliers: 7 (7.00%) in total
       - 1 (1.00%) low mild
       - 3 (3.00%) high mild
       - 3 (3.00%) high severe
   
   So, it looks like the "Single lock" strategy (using the `lock_only_once` function) works better than the "Multiple lock" strategy (using the `lock_multiple` function) for the given input size of 10,000, as we expected. The "Single lock" strategy takes about 39.5% less time to complete compared to the "Multiple lock" strategy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org