You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/05 15:50:37 UTC

[GitHub] [arrow] amol- opened a new pull request, #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

amol- opened a new pull request, #13075:
URL: https://github.com/apache/arrow/pull/13075

   The function is focused on Tables, but as the extra work was minimal I added support for Dataset too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13075:
URL: https://github.com/apache/arrow/pull/13075#discussion_r868086017


##########
python/pyarrow/tests/test_exec_plan.py:
##########
@@ -190,3 +191,35 @@ def test_table_join_keys_order():
         "colVals_l": ["a", "b", "f", None],
         "colVals_r": ["A", "B", None, "Z"],
     })
+
+
+@pytest.mark.parametrize("use_datasets", [False, True])
+def test_filter_table(use_datasets):
+    t = pa.table({
+        "a": [1, 2, 3, 4, 5],
+        "b": [10, 20, 30, 40, 50]
+    })
+    if use_datasets:
+        t = ds.dataset([t])
+
+    result = ep._filter_table(
+        t, (pc.field("a") <= pc.scalar(3)) & (pc.field("b") == pc.scalar(20)),
+        output_type=pa.Table if not use_datasets else ds.InMemoryDataset
+    )
+    if use_datasets:
+        result = result.to_table()
+    assert result == pa.table({
+        "a": [2],
+        "b": [20]
+    })
+
+    result = ep._filter_table(
+        t, pc.field("b") > pc.scalar(30),
+        output_type=pa.Table if not use_datasets else ds.InMemoryDataset
+    )
+    if use_datasets:
+        result = result.to_table()
+    assert result == pa.table({
+        "a": [4, 5],
+        "b": [40, 50]
+    })

Review Comment:
   Maybe add a small check that ensures a proper error is raised (and eg no segfault) if 1) passing an expressions that doesn't return a boolean value, 2) passing an expression referencing a non-existing field name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1118730177

   https://issues.apache.org/jira/browse/ARROW-16467


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1122058756

   What I observed from trying out this branch is that it does not preserve:
   
   ```
   In [18]: table1 = pa.table({'a': [1, 2, 3, 4], 'b': ['a'] * 4})
   
   In [19]: table2 = pa.table({'a': [1, 2, 3, 4], 'b': ['b'] * 4})
   
   In [20]: table = pa.concat_tables([table1, table2])
   
   In [21]: ep._filter_table(table, pc.field('a') == 1)
   Out[21]: 
   pyarrow.Table
   a: int64
   b: string
   ----
   a: [[1],[1]]
   b: [["b"],["a"]]
   
   In [22]: ep._filter_table(table, pc.field('a') == 1)
   Out[22]: 
   pyarrow.Table
   a: int64
   b: string
   ----
   a: [[1],[1]]
   b: [["a"],["b"]]
   ```
   
   But the current cython wrappers of the ExecPlan are not using a "to_table" method, it is using `Table::FromRecordBatchReader`, with a record batch generator created using `MakeGeneratorReader`, which explicitly says it does not preserve order:
   
   https://github.com/apache/arrow/blob/5a4b3db396f88898917620de50863b6d3a477a7a/cpp/src/arrow/compute/exec/exec_plan.h#L440-L447
   
   I see that we currently use a "sink" node as final output node, but there is also a "table_sink" node. That might simplify the code a bit to get a Table (without manually creating a record batch reader and creating a table from that), but  don't think that would help with the ordering? (at least in the code I also don't see any ordering handling for this sink)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

amol- commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1122352950

   Given that preserving order in the proper way requires a non trivial amount of work from what I can see (not much of the required infrastructure is available outside of Datasets) I pushed a work-around in the current branch and I made https://issues.apache.org/jira/browse/ARROW-16518 to come up with a solution that works with threads too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

westonpace commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1121557506

   Yes, `to_table` supports this and it looks like that is what is being used here so it should work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1125676315

   Benchmark runs are scheduled for baseline = a8479e9c252482438b6fc2bc0383ac5cf6a09d59 and contender = 0bae875bb8d992aab762e7fd81eef24bda7963ad. 0bae875bb8d992aab762e7fd81eef24bda7963ad is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/12b9b01f6829478ea1a8b604dc44e24d...cb0c56eae5754aeeb7971646d7f09210/)
   [Failed :arrow_down:0.74% :arrow_up:0.04%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/680e316e8d324fefa371f9564808b551...eea27e77e2164154b954c172cf773500/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2945509275b04e5ebaa61ef12a77fd49...8d60cd7893f74a06b375d7a30c85a33e/)
   [Finished :arrow_down:0.28% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/896bd0fb06654959a22633cf129ed076...896aec7197dd497297c2427d51e9a734/)
   Buildkite builds:
   [Finished] [`0bae875b` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/739)
   [Failed] [`0bae875b` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/736)
   [Finished] [`0bae875b` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/726)
   [Finished] [`0bae875b` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/741)
   [Finished] [`a8479e9c` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/738)
   [Finished] [`a8479e9c` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/735)
   [Finished] [`a8479e9c` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/725)
   [Finished] [`a8479e9c` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/740)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a diff in pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

amol- commented on code in PR #13075:
URL: https://github.com/apache/arrow/pull/13075#discussion_r870030999


##########
python/pyarrow/tests/test_exec_plan.py:
##########
@@ -190,3 +191,35 @@ def test_table_join_keys_order():
         "colVals_l": ["a", "b", "f", None],
         "colVals_r": ["A", "B", None, "Z"],
     })
+
+
+@pytest.mark.parametrize("use_datasets", [False, True])
+def test_filter_table(use_datasets):
+    t = pa.table({
+        "a": [1, 2, 3, 4, 5],
+        "b": [10, 20, 30, 40, 50]
+    })
+    if use_datasets:
+        t = ds.dataset([t])
+
+    result = ep._filter_table(
+        t, (pc.field("a") <= pc.scalar(3)) & (pc.field("b") == pc.scalar(20)),
+        output_type=pa.Table if not use_datasets else ds.InMemoryDataset
+    )
+    if use_datasets:
+        result = result.to_table()
+    assert result == pa.table({
+        "a": [2],
+        "b": [20]
+    })
+
+    result = ep._filter_table(
+        t, pc.field("b") > pc.scalar(30),
+        output_type=pa.Table if not use_datasets else ds.InMemoryDataset
+    )
+    if use_datasets:
+        result = result.to_table()
+    assert result == pa.table({
+        "a": [4, 5],
+        "b": [40, 50]
+    })

Review Comment:
   👍 done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1123647353

   Yes, that will indeed require some non-trivial work in the engine. Leaving it as is for now is good for me!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

amol- commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1122292526

   > I see that we currently use a "sink" node as final output node, but there is also a "table_sink" node. That might simplify the code a bit to get a Table (without manually creating a record batch reader and creating a table from that), but don't think that would help with the ordering? (at least in the code I also don't see any ordering handling for this sink)
   
   I tried and can confirm that using `table_sink` node doesn't help with preserving the order.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche closed pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression
URL: https://github.com/apache/arrow/pull/13075


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1121196019

   > Does the exec plan already support tagging batches and reordering them when gathering into a table? I seem to recall that we do that for the Dataset to_table (but I don't know if this logic still lives in the scanner or has moved to exec plan)
   
   cc @westonpace 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1122074114

   For the dataset scanner, I think the reordering logic lives in `AsyncScanner::ToTable()` / `AsyncTableAssemblyState`, which uses the fragment index to put the EnumeratedRecordBatch objects in the correct order in the resulting vector of batches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org