You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/04/24 10:40:29 UTC

[GitHub] [arrow] Fokko opened a new issue, #35301: [Python] Allow passing in a mask when creating a scanner

Fokko opened a new issue, #35301:
URL: https://github.com/apache/arrow/issues/35301

   ### Describe the enhancement requested
   
   PyIceberg uses positional deletes to indicate which rows should be omitted. In the normal `table.filter(..)` a boolean mask is allowed, but not when constructing a scanner. Therefore we first have to read the data into a table, and then filter it. It would simplify the code (and also speed up) if this can be done directly when reading the data.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #35301: [Python] Allow passing in a mask when creating a scanner

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #35301:
URL: https://github.com/apache/arrow/issues/35301#issuecomment-1540486938

   Have you thought about how you want to handle these masks in conjunction with filter pushdown? I wonder if the best approach would be to expose a new hidden column `__row_index` in the scanner. Then you can pass down a filter (which can prune row groups or even data pages), and then the data returned with have the original row indices in that `__row_index` column. From there you could filter the results for where the `__filename` column matches the filename and `__row_index` matches doesn't match any of the deleted rows in the mask. What do you think of that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdblue commented on issue #35301: [Python] Allow passing in a mask when creating a scanner

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on issue #35301:
URL: https://github.com/apache/arrow/issues/35301#issuecomment-1567417047

   The `__row_index` idea is what we do in the Java Iceberg implementation. That works well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35301: [Python] Allow passing in a mask when creating a scanner

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35301:
URL: https://github.com/apache/arrow/issues/35301#issuecomment-1542536407

   I was thinking of using the exec batch's index.  However, I like the `__row_index` idea better since that wouldn't require any sequencing of batches so it should be more performant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35301: [Python] Allow passing in a mask when creating a scanner

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35301:
URL: https://github.com/apache/arrow/issues/35301#issuecomment-1523022843

   Given you want a positional delete, would this rather be a "take" operation than a "filter". I know this is essentially the same (under the hood, filter an array also does a "take" of the required values), but conceptually for a Dataset this might be different. A filter can be defined with an expression, but a "take" is always with actual materialized values. And so we already have a `Dataset.take()` method that does that.
   
   So even if you start with a boolean filter, you should be able to already use `take()` by converting the boolean mask to indices with `pyarrow.compute.indices_nonzero`. 
   
   I am not fully sure how `Scanner::TakeRows` works given that positional indices depend on the order that data is scanned. I assume it follows the order of the actual vector of fragments.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35301: [Python] Allow passing in a mask when creating a scanner

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35301:
URL: https://github.com/apache/arrow/issues/35301#issuecomment-1535926749

   > I think it would be easier to create an order-dependent node that takes in a list of indices and skips those indices.
   
   So like the current FetchNode, but then using an array of indices instead of a slice as input argument? (and omitting vs selecting)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35301: [Python] Allow passing in a mask when creating a scanner

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35301:
URL: https://github.com/apache/arrow/issues/35301#issuecomment-1538996520

   > So like the current FetchNode, but then using an array of indices instead of a slice as input argument? (and omitting vs selecting)
   
   Yes, it would need a sequencing queue like the fetch node and the process function would be something like...
   
   ```
   row_start = current_offset;
   row_end = row_start + batch.length;
   deleted_indices = get_deleted_indices_in_range(row_start, row_end)
   if deleted_indices:
     normalized = deleted_indices - row_start
     kept_indices = range(0, batch.length) - normalized
     batch = batch.take(kept_indices)
   output.input_received(batch)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35301: [Python] Allow passing in a mask when creating a scanner

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35301:
URL: https://github.com/apache/arrow/issues/35301#issuecomment-1535504893

   The scanner assumes that inputs are extremely large (or potentially infinite) streams.  How would a mask be provided in that case?
   
   I think it would be easier to create an order-dependent node that takes in a list of indices and skips those indices.  Would that work?  I don't think it would be much effort to create.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Fokko commented on issue #35301: [Python] Allow passing in a mask when creating a scanner

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #35301:
URL: https://github.com/apache/arrow/issues/35301#issuecomment-1536537859

   Based on @jorisvandenbossche his suggestion, I've updated the PR: https://github.com/apache/iceberg/pull/6775/files#diff-49144c27eab7e03926b0310aa8b513dbeb9c8d2a0d33bacf8dedbd88b4680aacR782-R787. 
   
   The `.take()` operation works perfectly to remove the deleted rows. The next step would be to first apply the deletes filter, and then a user filter (an additional filter, after the mask has been applied), and push all of this down to Arrow in a single filter operation. This would avoid materializing the table, and then filtering it even further down.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org