You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/05 13:07:10 UTC

[GitHub] [arrow-rs] alamb opened a new issue, #1523: `take` kernel that works across multiple `RecordBatch`es

alamb opened a new issue, #1523:
URL: https://github.com/apache/arrow-rs/issues/1523

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   For several operations in data processing, it is important to be able to select some subset (for sorting or filtering)
   
   For example, the current [take](https://docs.rs/arrow/11.1.0/arrow/compute/kernels/take/index.html) kernel works like this:
   
   ```text
   ┌─────────────────┐      ┌─────────┐                              ┌─────────────────┐
   │        A        │      │    0    │                              │        A        │
   ├─────────────────┤      ├─────────┤                              ├─────────────────┤
   │        D        │      │    2    │                              │        B        │
   ├─────────────────┤      ├─────────┤   take(values, indicies)     ├─────────────────┤
   │        B        │      │    3    │ ─────────────────────────▶   │        C        │
   ├─────────────────┤      ├─────────┤                              ├─────────────────┤
   │        C        │      │    1    │                              │        D        │
   ├─────────────────┤      └─────────┘                              └─────────────────┘
   │        E        │                                                                  
   └─────────────────┘                                                                  
      values array            indicies array                              result        
                                                                                        
   ```
   
   In DataFusion, our operators get multiple record batches at a time, and we would like to do stuff like sort them without first combining into a single record batch. For example:
   ```
   ┌─────────────────┐                                                        
   │        A        │                                                        
   ├─────────────────┤                                                        
   │        D        │                                     ┌─────────────────┐
   └─────────────────┘                                     │        A        │
     values array 0                                        ├─────────────────┤
                                                           │        B        │
                                        ?                  ├─────────────────┤
                                                           │        C        │
   ┌─────────────────┐     ─────────────────────────▶      ├─────────────────┤
   │        B        │                                     │        D        │
   ├─────────────────┤                                     └─────────────────┘
   │        C        │                                                        
   ├─────────────────┤                                                        
   │        E        │                                      desired result    
   └─────────────────┘                                                        
     values array 1                                                           
   ```
   
   
   
   
   **Describe the solution you'd like**
   
   I would like a function something like `batch_take` that takes a vector of `RecordBatch`es and a list of `(record_batch_index, offset_in_the_record_batch)` tuples and produces the resulting array, like:
   
   
   ```
   ┌─────────────────┐      ┌─────────┐                                  ┌─────────────────┐
   │        A        │      │ (0, 0)  │        batch_take(               │        A        │
   ├─────────────────┤      ├─────────┤          [values0, values1],     ├─────────────────┤
   │        D        │      │ (1, 0)  │          batch_indicies          │        B        │
   └─────────────────┘      ├─────────┤        )                         ├─────────────────┤
     values array 0         │ (1, 1)  │      ─────────────────────────▶  │        C        │
                            ├─────────┤                                  ├─────────────────┤
                            │ (1, 0)  │                                  │        D        │
                            └─────────┘                                  └─────────────────┘
   ┌─────────────────┐                                                                      
   │        B        │                                                                      
   ├─────────────────┤   batch_indicies                                       result        
   │        C        │        array                                                         
   ├─────────────────┤                                                                      
   │        E        │                                                                      
   └─────────────────┘                                                                      
     values array 1                                                                                                                                                                                                                    
   ```
   
   This will come up in Grouping and Join operators as well. 
   
   **Describe alternatives you've considered**
   There are two more features that @yjshen  added in https://github.com/apache/arrow-datafusion/pull/2132 that we might contemplate:
   1. Take a list of `(record_batch_index, offset_in_the_record_batch, num_records)` to optimize the common case of copying multiple rows from each source batch.
   2. Provide an iterator interface so that the results can be formed a batch at a time, rather than in one large array
   
   
   **Additional context**
   This came up while @yjshen was  implementing a more memory efficient sort in DataFusion: https://github.com/apache/arrow-datafusion/pull/2132 and suggested by @Dandandan https://github.com/apache/arrow-datafusion/pull/2132#discussion_r842156111
   
   We can probably move a bunch of the implementation from that PR to this one. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on issue #1523: `take` kernel that works across multiple `RecordBatch`es

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1523:
URL: https://github.com/apache/arrow-rs/issues/1523#issuecomment-1128815568

   > Hi @Ted-Jiang -- thanks! I don't have any implementations at the moment. It may be interesting to at the other linked PRs to this ticket
   
   Sure, have found some interesting info.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1523: `take` kernel that works across multiple `RecordBatch`es

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1523:
URL: https://github.com/apache/arrow-rs/issues/1523#issuecomment-1128720792

   Hi @Ted-Jiang  -- thanks! I don't have any implementations at the moment. It may be interesting to at the other linked PRs to this ticket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on issue #1523: `take` kernel that works across multiple `RecordBatch`es

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1523:
URL: https://github.com/apache/arrow-rs/issues/1523#issuecomment-1128495236

   Great explanation, I am interested in this, may i have a try 😁 
   If this is in your plan, i am glad to see your implementation  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1523: `take` kernel that works across multiple `RecordBatch`es

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #1523: `take` kernel that works across multiple `RecordBatch`es
URL: https://github.com/apache/arrow-rs/issues/1523


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org