You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/29 12:36:25 UTC

[GitHub] [arrow] Dandandan commented on pull request #9036: ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches

Dandandan commented on pull request #9036:
URL: https://github.com/apache/arrow/pull/9036#issuecomment-752060487


   An important source of slowness seems to be in the (use and inefficiency of) creating the `MutableArrayData` structure. In profiling I see a lot in `build_extend`, `freeze` etc. 
   
   Changing the piece of code to generate a `Vec<&ArrayData>` directly gives a ~10% speedup locally on batches of size 1000 on your branch @andygrove :
   ```rust
           let (is_primary, arrays) = match primary[0].schema().index_of(field.name()) {
               Ok(i) => Ok((true, primary.iter().map(|batch| batch.column(i).data_ref().as_ref()).collect::<Vec<_>>())),
               Err(_) => {
                   match secondary[0].schema().index_of(field.name()) {
                       Ok(i) => Ok((false, secondary.iter().map(|batch| batch.column(i).data_ref().as_ref()).collect::<Vec<_>>())),
                       _ => Err(DataFusionError::Internal(
                           format!("During execution, the column {} was not found in neither the left or right side of the join", field.name()).to_string()
                       ))
                   }
               }
           }.map_err(DataFusionError::into_arrow_external_error)?;
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org