You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/04 14:26:21 UTC

[GitHub] [arrow] paleolimbot opened a new pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

paleolimbot opened a new pull request #12564:
URL: https://github.com/apache/arrow/pull/12564


   This is an early draft PR to add a basic Substrait consumer, at least enough for us to test our Substrait output to help us stay on track.
   
   The main issue I have is that I can't get this to build locally, even though I'm building with `-DARROW_ENGINE=ON`. I don't think that `libsubstrait.a` is getting installed nor are the headers in `arrow/engine/substrait`, and I'm not sure if this is a config error or whether I just don't know cmake well enough to make it work on my machine.
   
   The second thing that I don't know how to do is how to get the source node at the top of the serialized plan.
   
   The third thing I don't know how to do is to get the plan to return a RecordBatchReader like we do with the other exec plan runner.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r836969452



##########
File path: r/tests/testthat/test-query-engine.R
##########
@@ -0,0 +1,86 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+test_that("do_exec_plan_substrait can evaluate a simple plan", {
+  skip_if_not_available("engine")
+
+  df <- data.frame(i = 1:5, b = rep_len(c(TRUE, FALSE), 5))
+  table <- arrow_table(df, schema = schema(i = int64(), b = bool()))
+
+  tf <- tempfile()
+  on.exit(unlink(tf))
+  write_parquet(table, tf)
+
+  substrait_json <- sprintf('{
+    "relations": [
+      {"rel": {
+        "read": {
+          "base_schema": {
+            "struct": {
+              "types": [ {"i64": {}}, {"bool": {}} ]
+            },
+            "names": ["i", "b"]
+          },
+          "local_files": {
+            "items": [
+              {
+                "uri_file": "file://%s",
+                "format": "FILE_FORMAT_PARQUET"
+              }
+            ]
+          }
+        }
+      }}
+    ],
+    "extension_uris": [
+      {
+        "extension_uri_anchor": 7,
+        "uri": "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml"

Review comment:
       Hmm...or maybe we even need `substrait/cpp/v1.0.0` as the different implementations may have different "hints" and "extensions" that they support :grimacing: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Finished :arrow_down:0.17% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/407| `4d0436a0` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/393| `4d0436a0` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/403| `4d0436a0` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/406| `6f9b07a4` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/392| `6f9b07a4` test-mac-arm>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/392| `6f9b07a4` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/402| `6f9b07a4` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1059487075


   Whee...this works! (Although until ARROW-15849 is merged it's necessary to copy the arrow/engine/substrait/ headers manually...). It currently just prints the output...before this is merged I think it would be nice to return a `Table` to help us test output.
   
   ``` r
   # remotes::install_github("apache/arrow/r#12564")
   # remotes::install_github("voltrondata/substrait-r")
   library(arrow, warn.conflicts = FALSE)
   library(substrait)
   
   # currently works via the ReadRel object (not an Arrow source node),
   # so we need a parquet file to work with
   temp_parquet_file <- tempfile()
   write_parquet(mtcars, temp_parquet_file)
   
   plan <- substrait$Plan$create(
     relations = list(
       substrait$PlanRel$create(
         rel = substrait$Rel$create(
           read = substrait$ReadRel$create(
             base_schema = as_substrait(mtcars, "substrait.NamedStruct"),
             local_files = substrait$ReadRel$LocalFiles$create(
               items = list(
                 substrait$ReadRel$LocalFiles$FileOrFiles$create(
                   uri_file = sprintf("file://%s", temp_parquet_file),
                   format = substrait$ReadRel$LocalFiles$FileOrFiles$FileFormat$FILE_FORMAT_PARQUET
                 )
               )
             )
           )
         )
       )
     )
   )
   
   arrow:::do_exec_plan_substrait(as.raw(plan))
   #> <tag: 0 rows: 32>
   #> ExecBatch
   #>     # Rows: 32
   #>     0: Array[21,21,22.8,21.4,18.7,18.1,14.3,24.4,22.8,19.2,...,15.2,13.3,19.2,27.3,26,30.4,15.8,19.7,15,21.4]
   #>     1: Array[6,6,4,6,8,6,8,4,4,6,...,8,8,8,4,4,4,8,6,8,4]
   #>     2: Array[160,160,108,258,360,225,360,146.7,140.8,167.6,...,304,350,400,79,120.3,95.1,351,145,301,121]
   #>     3: Array[110,110,93,110,175,105,245,62,95,123,...,150,245,175,66,91,113,264,175,335,109]
   #>     4: Array[3.9,3.9,3.85,3.08,3.15,2.76,3.21,3.69,3.92,3.92,...,3.15,3.73,3.08,4.08,4.43,3.77,4.22,3.62,3.54,4.11]
   #>     5: Array[2.62,2.875,2.32,3.215,3.44,3.46,3.57,3.19,3.15,3.44,...,3.435,3.84,3.845,1.935,2.14,1.513,3.17,2.77,3.57,2.78]
   #>     6: Array[16.46,17.02,18.61,19.44,17.02,20.22,15.84,20,22.9,18.3,...,17.3,15.41,17.05,18.9,16.7,16.9,14.5,15.5,14.6,18.6]
   #>     7: Array[0,0,1,1,0,1,0,1,1,1,...,0,0,0,1,0,1,0,0,0,1]
   #>     8: Array[1,1,1,0,0,0,0,0,0,0,...,0,0,0,1,1,1,1,1,1,1]
   #>     9: Array[4,4,4,3,3,3,3,4,4,4,...,3,3,3,4,5,5,5,5,5,4]
   #>     10: Array[4,4,1,1,2,1,4,2,2,4,...,2,4,2,1,2,2,4,6,8,2]
   #>     11: Scalar[0]
   #>     12: Scalar[0]
   #>     13: Scalar[true]
   #> 
   #> <tag: 0 finished>
   ```
   
   <sup>Created on 2022-03-04 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r819653877



##########
File path: r/R/query-engine.R
##########
@@ -296,3 +296,39 @@ ExecNode <- R6Class("ExecNode",
     schema = function() ExecNode_output_schema(self)
   )
 )
+
+do_exec_plan_substrait <- function(.data, substrait_plan) {
+  if (is.string(substrait_plan)) {
+    substrait_plan <- engine__internal__SubstraitFromJSON(substrait_plan)
+  } else if (is.raw(substrait_plan)) {
+    substrait_plan <- buffer(substrait_plan)
+  } else {
+    abort("`substrait_plan` must be a JSON string or raw() vector")
+  }
+
+  plan <- ExecPlan$create()
+
+  if (inherits(.data, "RecordBatchReader")) {
+    source_node <- ExecNode_ReadFromRecordBatchReader(self, dataset$.data)
+  } else if (inehrits(.data, "ArrowTabular")) {
+    dataset <- InMemoryDataset$create(dataset)
+    source_node <- ExecNode_Scan(
+      plan,
+      dataset,
+      Expression$scalar(TRUE),
+      colnames %||% character(0)
+    )
+  } else if (inherits(.data, "Dataset")) {
+    source_node <- ExecNode_Scan(
+      plan,
+      .data,
+      Expression$scalar(TRUE),
+      colnames %||% character(0)
+    )
+  } else {
+    obj_desc <- paste0(class(.data), collapse = " / ")
+    abort(glue("Can't construct source node from object of type {obj_desc}"))
+  }

Review comment:
       plan$Build takes a query but plan$Scan doesn't require it: https://github.com/apache/arrow/blob/master/r/R/query-engine.R#L81




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r819808503



##########
File path: r/src/compute-exec.cpp
##########
@@ -277,4 +278,65 @@ std::shared_ptr<compute::ExecNode> ExecNode_ReadFromRecordBatchReader(
   return MakeExecNodeOrStop("source", plan.get(), {}, options);
 }
 
+// just until we figure out the incantation to get a proper sinknode

Review comment:
       That would be great! Perhaps for this PR I will keep to the example usage (which reads a parquet file and prints the output) and use the C++ method when it becomes available (it will take us a while to get the dplyr->Substrait generation right so there's no particular rush).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot removed a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot removed a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1059487075


   Whee...this works! (Although until ARROW-15849 is merged it's necessary to copy the arrow/engine/substrait/ headers manually...). It currently just prints the output...before this is merged I think it would be nice to return a `Table` to help us test output.
   
   ``` r
   # remotes::install_github("apache/arrow/r#12564")
   # remotes::install_github("voltrondata/substrait-r")
   library(arrow, warn.conflicts = FALSE)
   library(substrait)
   
   # currently works via the ReadRel object (not an Arrow source node),
   # so we need a parquet file to work with
   temp_parquet_file <- tempfile()
   write_parquet(mtcars, temp_parquet_file)
   
   plan <- substrait$Plan$create(
     relations = list(
       substrait$PlanRel$create(
         rel = substrait$Rel$create(
           read = substrait$ReadRel$create(
             base_schema = as_substrait(mtcars, "substrait.NamedStruct"),
             local_files = substrait$ReadRel$LocalFiles$create(
               items = list(
                 substrait$ReadRel$LocalFiles$FileOrFiles$create(
                   uri_file = sprintf("file://%s", temp_parquet_file),
                   format = substrait$ReadRel$LocalFiles$FileOrFiles$FileFormat$FILE_FORMAT_PARQUET
                 )
               )
             )
           )
         )
       )
     )
   )
   
   arrow:::do_exec_plan_substrait(as.raw(plan))
   #> <tag: 0 rows: 32>
   #> ExecBatch
   #>     # Rows: 32
   #>     0: Array[21,21,22.8,21.4,18.7,18.1,14.3,24.4,22.8,19.2,...,15.2,13.3,19.2,27.3,26,30.4,15.8,19.7,15,21.4]
   #>     1: Array[6,6,4,6,8,6,8,4,4,6,...,8,8,8,4,4,4,8,6,8,4]
   #>     2: Array[160,160,108,258,360,225,360,146.7,140.8,167.6,...,304,350,400,79,120.3,95.1,351,145,301,121]
   #>     3: Array[110,110,93,110,175,105,245,62,95,123,...,150,245,175,66,91,113,264,175,335,109]
   #>     4: Array[3.9,3.9,3.85,3.08,3.15,2.76,3.21,3.69,3.92,3.92,...,3.15,3.73,3.08,4.08,4.43,3.77,4.22,3.62,3.54,4.11]
   #>     5: Array[2.62,2.875,2.32,3.215,3.44,3.46,3.57,3.19,3.15,3.44,...,3.435,3.84,3.845,1.935,2.14,1.513,3.17,2.77,3.57,2.78]
   #>     6: Array[16.46,17.02,18.61,19.44,17.02,20.22,15.84,20,22.9,18.3,...,17.3,15.41,17.05,18.9,16.7,16.9,14.5,15.5,14.6,18.6]
   #>     7: Array[0,0,1,1,0,1,0,1,1,1,...,0,0,0,1,0,1,0,0,0,1]
   #>     8: Array[1,1,1,0,0,0,0,0,0,0,...,0,0,0,1,1,1,1,1,1,1]
   #>     9: Array[4,4,4,3,3,3,3,4,4,4,...,3,3,3,4,5,5,5,5,5,4]
   #>     10: Array[4,4,1,1,2,1,4,2,2,4,...,2,4,2,1,2,2,4,6,8,2]
   #>     11: Scalar[0]
   #>     12: Scalar[0]
   #>     13: Scalar[true]
   #> 
   #> <tag: 0 finished>
   ```
   
   <sup>Created on 2022-03-04 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r836635079



##########
File path: r/configure
##########
@@ -244,6 +244,14 @@ if [ $? -eq 0 ]; then
     # NOTE: arrow-dataset is assumed to have the same -L flag as arrow
     # so there is no need to add its location to PKG_DIRS
   fi
+  # Check for Arrow Engine subcomponent
+  grep -i 'set(ARROW_ENGINE "ON")' $ARROW_OPTS_CMAKE >/dev/null 2>&1
+  if [ $? -eq 0 ]; then
+    PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ENGINE"
+    PKG_LIBS="-larrow_engine $PKG_LIBS"
+    # NOTE: arrow-engine is assumed to have the same -L flag as arrow
+    # so there is no need to add its location to PKG_DIRS
+  fi

Review comment:
       Done! I really need to get a Windows environment up and running so that I notice these kinds of things.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1061817573


   Ok! This is ready for some eyes. The C++ that works with the ExecPlan is something that I got to work, but it is not elegant (notably, it requires that the output field names are known in advance). As @westonpace mentioned above, the C++ may be able to absorb the implementation for substrait -> RecordBatchReader, which is really what we want.
   
   I'm happy to put this PR on hold until there's more infrastructure to make this easier, although either way we will likely need several follow-up PRs. My preference would be to merge a preliminary version with an unexported `do_exec_plan_substrait()` so that Nic and I don't have to rebuild the R package every time we switch projects (but we can also make that work if that sounds too hacky). I think that in all cases we will need multiple follow up PRs as the capabilities of C++ substrait and the R substrait evolve.
   
   To make sure this PR is going to serve its purpose, I also implemented a preliminary substrait package function that calls `do_exec_plan_substrait()` (for reference below).
   
   <details>
   
   substrait_eval_arrow <- function(plan, tables) {
     plan <- as_substrait(plan, "substrait.Plan")
     stopifnot(rlang::is_named2(tables))
   
     # only support plans with exactly one relation in the relations list for now
     stopifnot(length(plan$relations) == 1)
   
     temp_parquet <- vapply(tables, function(i) tempfile(), character(1))
     on.exit(unlink(temp_parquet))
   
     local_file_tables <- lapply(seq_along(tables), function(i) {
       substrait$ReadRel$LocalFiles$create(
         items = list(
           substrait$ReadRel$LocalFiles$FileOrFiles$create(
             uri_file = sprintf("file://%s", temp_parquet[i]),
             format = substrait$ReadRel$LocalFiles$FileOrFiles$FileFormat$
               FILE_FORMAT_PARQUET
           )
         )
       )
     })
     names(local_file_tables) <- names(tables)
   
     table_base_schema <- lapply(tables, as_substrait, "substrait.NamedStruct")
   
     call_for_errors <- sys.call()
   
     # walk the relation tree looking for named tables, replacing
     # with those from local_file_tables
     plan <- rel_tree_modify(plan, "substrait_ReadRel", function(x) {
       if (isTRUE("named_table" %in% names(x))) {
         name <- x$named_table$names
   
         if (!isTRUE(name %in% names(local_file_tables))) {
           rlang::abort(
             sprintf("Named table '%s' not found in `tables`", name),
             call = call_for_errors
           )
         }
   
         if (!identical(x$base_schema, table_base_schema[[name]])) {
           rlang::abort(
             sprintf(
               "Base schema for table '%s' does not match declared base schema",
               name
             ),
             call = call_for_errors
           )
         }
   
         x$named_table <- NULL
         x$local_files <- local_file_tables[[name]]
         x
       } else {
         x
       }
     })
   
     col_names <- substrait_colnames(plan$relations[[1]])
   
     # write parquet files
     Map(arrow::write_parquet, tables, temp_parquet)
   
     # run the exec plan
     getNamespace("arrow")[["do_exec_plan_substrait"]](as.raw(plan), col_names)
   }
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r819786616



##########
File path: r/R/query-engine.R
##########
@@ -296,3 +296,39 @@ ExecNode <- R6Class("ExecNode",
     schema = function() ExecNode_output_schema(self)
   )
 )
+
+do_exec_plan_substrait <- function(.data, substrait_plan) {
+  if (is.string(substrait_plan)) {
+    substrait_plan <- engine__internal__SubstraitFromJSON(substrait_plan)
+  } else if (is.raw(substrait_plan)) {
+    substrait_plan <- buffer(substrait_plan)
+  } else {
+    abort("`substrait_plan` must be a JSON string or raw() vector")
+  }
+
+  plan <- ExecPlan$create()
+
+  if (inherits(.data, "RecordBatchReader")) {
+    source_node <- ExecNode_ReadFromRecordBatchReader(self, dataset$.data)
+  } else if (inehrits(.data, "ArrowTabular")) {
+    dataset <- InMemoryDataset$create(dataset)
+    source_node <- ExecNode_Scan(
+      plan,
+      dataset,
+      Expression$scalar(TRUE),
+      colnames %||% character(0)
+    )
+  } else if (inherits(.data, "Dataset")) {
+    source_node <- ExecNode_Scan(
+      plan,
+      .data,
+      Expression$scalar(TRUE),
+      colnames %||% character(0)
+    )
+  } else {
+    obj_desc <- paste0(class(.data), collapse = " / ")
+    abort(glue("Can't construct source node from object of type {obj_desc}"))
+  }

Review comment:
       The substrait consumer will not bring the pushdown from filter/project nodes into the read rel (though a query optimizer would ;))  In Substrait, you will need to do the same thing that you do with ExecPlans, only it will be ReadRel instead of ScanNode.  Scan predicate pushdown is expressed at `ReadRel::projection` and scan filter pushdown is specified at `ReadRel::filter`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r819626624



##########
File path: r/R/query-engine.R
##########
@@ -296,3 +296,39 @@ ExecNode <- R6Class("ExecNode",
     schema = function() ExecNode_output_schema(self)
   )
 )
+
+do_exec_plan_substrait <- function(.data, substrait_plan) {
+  if (is.string(substrait_plan)) {
+    substrait_plan <- engine__internal__SubstraitFromJSON(substrait_plan)
+  } else if (is.raw(substrait_plan)) {
+    substrait_plan <- buffer(substrait_plan)
+  } else {
+    abort("`substrait_plan` must be a JSON string or raw() vector")
+  }
+
+  plan <- ExecPlan$create()
+
+  if (inherits(.data, "RecordBatchReader")) {
+    source_node <- ExecNode_ReadFromRecordBatchReader(self, dataset$.data)
+  } else if (inehrits(.data, "ArrowTabular")) {
+    dataset <- InMemoryDataset$create(dataset)
+    source_node <- ExecNode_Scan(
+      plan,
+      dataset,
+      Expression$scalar(TRUE),
+      colnames %||% character(0)
+    )
+  } else if (inherits(.data, "Dataset")) {
+    source_node <- ExecNode_Scan(
+      plan,
+      .data,
+      Expression$scalar(TRUE),
+      colnames %||% character(0)
+    )
+  } else {
+    obj_desc <- paste0(class(.data), collapse = " / ")
+    abort(glue("Can't construct source node from object of type {obj_desc}"))
+  }

Review comment:
       Why isn't this just `plan$Scan(.data)`?
   
   This is probably a bigger question, but apparently (with ExecPlans) you have to express the filter and projection into the Scan node in order to get predicate pushdown to work. That might not work when consuming substrait, the substrait consumer is going to need to handle the pushdown.
   
   Also you have a typo `inehrits`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1061817573


   Ok! This is ready for some eyes. The C++ that works with the ExecPlan is something that I got to work, but it is not elegant (notably, it requires that the output field names are known in advance). As @westonpace mentioned above, the C++ may be able to absorb the implementation for substrait -> RecordBatchReader, which is really what we want.
   
   I'm happy to put this PR on hold until there's more infrastructure to make this easier, although either way we will likely need several follow-up PRs. My preference would be to merge a preliminary version with an unexported `do_exec_plan_substrait()` so that Nic and I don't have to rebuild the R package every time we switch projects (but we can also make that work if that sounds too hacky).
   
   To make sure this PR is going to serve its purpose, I also implemented a preliminary substrait package function that calls `do_exec_plan_substrait()` (for reference below).
   
   <details>
   
   ```r
   
   substrait_eval_arrow <- function(plan, tables) {
     plan <- as_substrait(plan, "substrait.Plan")
     stopifnot(rlang::is_named2(tables))
   
     # only support plans with exactly one relation in the relations list for now
     stopifnot(length(plan$relations) == 1)
   
     temp_parquet <- vapply(tables, function(i) tempfile(), character(1))
     on.exit(unlink(temp_parquet))
   
     local_file_tables <- lapply(seq_along(tables), function(i) {
       substrait$ReadRel$LocalFiles$create(
         items = list(
           substrait$ReadRel$LocalFiles$FileOrFiles$create(
             uri_file = sprintf("file://%s", temp_parquet[i]),
             format = substrait$ReadRel$LocalFiles$FileOrFiles$FileFormat$
               FILE_FORMAT_PARQUET
           )
         )
       )
     })
     names(local_file_tables) <- names(tables)
   
     table_base_schema <- lapply(tables, as_substrait, "substrait.NamedStruct")
   
     call_for_errors <- sys.call()
   
     # walk the relation tree looking for named tables, replacing
     # with those from local_file_tables
     plan <- rel_tree_modify(plan, "substrait_ReadRel", function(x) {
       if (isTRUE("named_table" %in% names(x))) {
         name <- x$named_table$names
   
         if (!isTRUE(name %in% names(local_file_tables))) {
           rlang::abort(
             sprintf("Named table '%s' not found in `tables`", name),
             call = call_for_errors
           )
         }
   
         if (!identical(x$base_schema, table_base_schema[[name]])) {
           rlang::abort(
             sprintf(
               "Base schema for table '%s' does not match declared base schema",
               name
             ),
             call = call_for_errors
           )
         }
   
         x$named_table <- NULL
         x$local_files <- local_file_tables[[name]]
         x
       } else {
         x
       }
     })
   
     col_names <- substrait_colnames(plan$relations[[1]])
   
     # write parquet files
     Map(arrow::write_parquet, tables, temp_parquet)
   
     # run the exec plan
     getNamespace("arrow")[["do_exec_plan_substrait"]](as.raw(plan), col_names)
   }
   
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r836570533



##########
File path: r/configure
##########
@@ -244,6 +244,14 @@ if [ $? -eq 0 ]; then
     # NOTE: arrow-dataset is assumed to have the same -L flag as arrow
     # so there is no need to add its location to PKG_DIRS
   fi
+  # Check for Arrow Engine subcomponent
+  grep -i 'set(ARROW_ENGINE "ON")' $ARROW_OPTS_CMAKE >/dev/null 2>&1
+  if [ $? -eq 0 ]; then
+    PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ENGINE"
+    PKG_LIBS="-larrow_engine $PKG_LIBS"
+    # NOTE: arrow-engine is assumed to have the same -L flag as arrow
+    # so there is no need to add its location to PKG_DIRS
+  fi

Review comment:
       In `configure.win`, could we add the following to configure_dev()?
   
   ```
     if [ $(cmake_option ARROW_ENGINE) -eq 1 ]; then
       PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_ENGINE"
       PKG_CONFIG_PACKAGES="$PKG_CONFIG_PACKAGES arrow-engine"
     fi
   ```
   
   (I wouldn't worry about `configure_release`, since there are corresponding CI changes that would need to be handled there).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r837386372



##########
File path: r/tests/testthat/test-query-engine.R
##########
@@ -0,0 +1,86 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+test_that("do_exec_plan_substrait can evaluate a simple plan", {
+  skip_if_not_available("engine")
+
+  df <- data.frame(i = 1:5, b = rep_len(c(TRUE, FALSE), 5))
+  table <- arrow_table(df, schema = schema(i = int64(), b = bool()))
+
+  tf <- tempfile()
+  on.exit(unlink(tf))
+  write_parquet(table, tf)
+
+  substrait_json <- sprintf('{
+    "relations": [
+      {"rel": {
+        "read": {
+          "base_schema": {
+            "struct": {
+              "types": [ {"i64": {}}, {"bool": {}} ]
+            },
+            "names": ["i", "b"]
+          },
+          "local_files": {
+            "items": [
+              {
+                "uri_file": "file://%s",
+                "format": "FILE_FORMAT_PARQUET"
+              }
+            ]
+          }
+        }
+      }}
+    ],
+    "extension_uris": [
+      {
+        "extension_uri_anchor": 7,
+        "uri": "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml"

Review comment:
       That's good to know! I will leave it out for this particular test but we will need a way for the arrow package to communicate that string (among other things) to the substrait package at some point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r836507290



##########
File path: r/tests/testthat/test-query-engine.R
##########
@@ -0,0 +1,86 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+test_that("do_exec_plan_substrait can evaluate a simple plan", {
+  skip_if_not_available("engine")
+
+  df <- data.frame(i = 1:5, b = rep_len(c(TRUE, FALSE), 5))
+  table <- arrow_table(df, schema = schema(i = int64(), b = bool()))
+
+  tf <- tempfile()
+  on.exit(unlink(tf))
+  write_parquet(table, tf)
+
+  substrait_json <- sprintf('{
+    "relations": [
+      {"rel": {
+        "read": {
+          "base_schema": {
+            "struct": {
+              "types": [ {"i64": {}}, {"bool": {}} ]
+            },
+            "names": ["i", "b"]
+          },
+          "local_files": {
+            "items": [
+              {
+                "uri_file": "file://%s",
+                "format": "FILE_FORMAT_PARQUET"
+              }
+            ]
+          }
+        }
+      }}
+    ],
+    "extension_uris": [
+      {
+        "extension_uri_anchor": 7,
+        "uri": "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml"

Review comment:
       This is more a substrait question, but does something like `https://github.com/apache/arrow/blob/-/format/substrait/extension_types.yaml` work (so that the default branch is used instead of hardcoding?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r819791151



##########
File path: r/src/compute-exec.cpp
##########
@@ -277,4 +278,65 @@ std::shared_ptr<compute::ExecNode> ExecNode_ReadFromRecordBatchReader(
   return MakeExecNodeOrStop("source", plan.get(), {}, options);
 }
 
+// just until we figure out the incantation to get a proper sinknode

Review comment:
       ARROW-15849




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
jonkeane commented on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1059406392


   https://github.com/apache/arrow/pull/12569 also looks relevant for the build issues


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1061817573


   Ok! This is ready for some eyes. The C++ that works with the ExecPlan is something that I got to work, but it is not elegant (notably, it requires that the output field names are known in advance). As @westonpace mentioned above, the C++ may be able to absorb the implementation for substrait -> RecordBatchReader, which is really what we want.
   
   I'm happy to put this PR on hold until there's more infrastructure to make this easier, although either way we will likely need several follow-up PRs. My preference would be to merge a preliminary version with an unexported `do_exec_plan_substrait()` so that Nic and I don't have to rebuild the R package every time we switch projects (but we can also make that work if that sounds too hacky).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r836968905



##########
File path: r/tests/testthat/test-query-engine.R
##########
@@ -0,0 +1,86 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+test_that("do_exec_plan_substrait can evaluate a simple plan", {
+  skip_if_not_available("engine")
+
+  df <- data.frame(i = 1:5, b = rep_len(c(TRUE, FALSE), 5))
+  table <- arrow_table(df, schema = schema(i = int64(), b = bool()))
+
+  tf <- tempfile()
+  on.exit(unlink(tf))
+  write_parquet(table, tf)
+
+  substrait_json <- sprintf('{
+    "relations": [
+      {"rel": {
+        "read": {
+          "base_schema": {
+            "struct": {
+              "types": [ {"i64": {}}, {"bool": {}} ]
+            },
+            "names": ["i", "b"]
+          },
+          "local_files": {
+            "items": [
+              {
+                "uri_file": "file://%s",
+                "format": "FILE_FORMAT_PARQUET"
+              }
+            ]
+          }
+        }
+      }}
+    ],
+    "extension_uris": [
+      {
+        "extension_uri_anchor": 7,
+        "uri": "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml"

Review comment:
       Long term we will want this to be a hosted, versioned URL I think.  Something like `https://arrow.apache.org/v1.0.0/extension_types.yaml`.  Practically speaking this is just a namespace (XML does the same thing with xmlns) and so it is  important that producers and consumers agree on the exact same string.  It's useful, but not essential, that the URI is actually resolvable.

##########
File path: r/tests/testthat/test-query-engine.R
##########
@@ -0,0 +1,86 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+test_that("do_exec_plan_substrait can evaluate a simple plan", {
+  skip_if_not_available("engine")
+
+  df <- data.frame(i = 1:5, b = rep_len(c(TRUE, FALSE), 5))
+  table <- arrow_table(df, schema = schema(i = int64(), b = bool()))
+
+  tf <- tempfile()
+  on.exit(unlink(tf))
+  write_parquet(table, tf)
+
+  substrait_json <- sprintf('{
+    "relations": [
+      {"rel": {
+        "read": {
+          "base_schema": {
+            "struct": {
+              "types": [ {"i64": {}}, {"bool": {}} ]
+            },
+            "names": ["i", "b"]
+          },
+          "local_files": {
+            "items": [
+              {
+                "uri_file": "file://%s",
+                "format": "FILE_FORMAT_PARQUET"
+              }
+            ]
+          }
+        }
+      }}
+    ],
+    "extension_uris": [
+      {
+        "extension_uri_anchor": 7,
+        "uri": "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml"

Review comment:
       Long term we will want this to be a hosted, versioned URL I think.  Something like `https://arrow.apache.org/v1.0.0/substrait/extension_types.yaml`.  Practically speaking this is just a namespace (XML does the same thing with xmlns) and so it is  important that producers and consumers agree on the exact same string.  It's useful, but not essential, that the URI is actually resolvable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Finished :arrow_down:0.71% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Finished :arrow_down:0.17% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/407| `4d0436a0` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/393| `4d0436a0` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/393| `4d0436a0` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/403| `4d0436a0` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/406| `6f9b07a4` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/392| `6f9b07a4` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/392| `6f9b07a4` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/402| `6f9b07a4` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Finished :arrow_down:0.17% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/407| `4d0436a0` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/393| `4d0436a0` test-mac-arm>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/393| `4d0436a0` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/403| `4d0436a0` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/406| `6f9b07a4` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/392| `6f9b07a4` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/392| `6f9b07a4` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/402| `6f9b07a4` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Finished :arrow_down:0.17% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Finished :arrow_down:0.17% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane closed pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
jonkeane closed pull request #12564:
URL: https://github.com/apache/arrow/pull/12564


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1059208150






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r819788630



##########
File path: r/src/compute-exec.cpp
##########
@@ -277,4 +278,65 @@ std::shared_ptr<compute::ExecNode> ExecNode_ReadFromRecordBatchReader(
   return MakeExecNodeOrStop("source", plan.get(), {}, options);
 }
 
+// just until we figure out the incantation to get a proper sinknode

Review comment:
       I think we would eventually want to do the same thing we do with exec plans which is to consume with pushgenerator (which is basically a producer/consumer queue).  The sink node consumer will push the batch into the queue and then a record batch reader will read the batches from the queue.  We'd also want to wire it up to backpressure like we have today.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r819789166



##########
File path: r/src/compute-exec.cpp
##########
@@ -277,4 +278,65 @@ std::shared_ptr<compute::ExecNode> ExecNode_ReadFromRecordBatchReader(
   return MakeExecNodeOrStop("source", plan.get(), {}, options);
 }
 
+// just until we figure out the incantation to get a proper sinknode

Review comment:
       In fact, to reduce duplication, we may want to add a C++ method that takes in a Substrait plan, and returns a record batch reader.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r836635722



##########
File path: r/tests/testthat/test-query-engine.R
##########
@@ -0,0 +1,86 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+test_that("do_exec_plan_substrait can evaluate a simple plan", {
+  skip_if_not_available("engine")
+
+  df <- data.frame(i = 1:5, b = rep_len(c(TRUE, FALSE), 5))
+  table <- arrow_table(df, schema = schema(i = int64(), b = bool()))
+
+  tf <- tempfile()
+  on.exit(unlink(tf))
+  write_parquet(table, tf)
+
+  substrait_json <- sprintf('{
+    "relations": [
+      {"rel": {
+        "read": {
+          "base_schema": {
+            "struct": {
+              "types": [ {"i64": {}}, {"bool": {}} ]
+            },
+            "names": ["i", "b"]
+          },
+          "local_files": {
+            "items": [
+              {
+                "uri_file": "file://%s",
+                "format": "FILE_FORMAT_PARQUET"
+              }
+            ]
+          }
+        }
+      }}
+    ],
+    "extension_uris": [
+      {
+        "extension_uri_anchor": 7,
+        "uri": "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml"

Review comment:
       It's a good catch...it turns out the extension bits aren't needed at all for the test to pass, so I just removed them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Finished :arrow_down:0.17% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/407| `4d0436a0` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/393| `4d0436a0` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/403| `4d0436a0` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/406| `6f9b07a4` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/392| `6f9b07a4` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/402| `6f9b07a4` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1081875949


   Benchmark runs are scheduled for baseline = 6f9b07a41b50a6d2db37a28912ad22148453fc54 and contender = 4d0436a04cd8e1b01b7f4ae6169973760088c702. 4d0436a04cd8e1b01b7f4ae6169973760088c702 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/965bb209cace477cbaf6601b7105107a...878fbf3c04894759a0ad3a75c1cc9138/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/8b4a561ed7314efa84bbcfc30d2f1146...f6ea13d3564246138d8a4f8a8aab35ca/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fbac1b8f329b4f5abadea48853f31f33...ec35388843d6486d903f9e34132c7603/)
   [Finished :arrow_down:0.17% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6f49791a6174d5bb85050f728b9c494...59f9ad6d3c9c465fbec90895d64a51a3/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r819650353



##########
File path: r/R/query-engine.R
##########
@@ -296,3 +296,39 @@ ExecNode <- R6Class("ExecNode",
     schema = function() ExecNode_output_schema(self)
   )
 )
+
+do_exec_plan_substrait <- function(.data, substrait_plan) {
+  if (is.string(substrait_plan)) {
+    substrait_plan <- engine__internal__SubstraitFromJSON(substrait_plan)
+  } else if (is.raw(substrait_plan)) {
+    substrait_plan <- buffer(substrait_plan)
+  } else {
+    abort("`substrait_plan` must be a JSON string or raw() vector")
+  }
+
+  plan <- ExecPlan$create()
+
+  if (inherits(.data, "RecordBatchReader")) {
+    source_node <- ExecNode_ReadFromRecordBatchReader(self, dataset$.data)
+  } else if (inehrits(.data, "ArrowTabular")) {
+    dataset <- InMemoryDataset$create(dataset)
+    source_node <- ExecNode_Scan(
+      plan,
+      dataset,
+      Expression$scalar(TRUE),
+      colnames %||% character(0)
+    )
+  } else if (inherits(.data, "Dataset")) {
+    source_node <- ExecNode_Scan(
+      plan,
+      .data,
+      Expression$scalar(TRUE),
+      colnames %||% character(0)
+    )
+  } else {
+    obj_desc <- paste0(class(.data), collapse = " / ")
+    abort(glue("Can't construct source node from object of type {obj_desc}"))
+  }

Review comment:
       My reading of it was that `plan$Scan(.data)` needed an `arrow_dplyr_query` and I was trying to keep the initial "this thing works" as simple as possible.
   
   I imagine we could do some inspection of the substrait plan to extract which field references actually show up (totally possible from what I currently know about substrait), or do that inspection the C++ consumer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1061817573


   Ok! This is ready for some eyes. The C++ that works with the ExecPlan is something that I got to work, but it is not elegant (notably, it requires that the output field names are known in advance). As @westonpace mentioned above, the C++ may be able to absorb the implementation for substrait -> RecordBatchReader, which is really what we want.
   
   I'm happy to put this PR on hold until there's more infrastructure to make this easier, although either way we will likely need several follow-up PRs. My preference would be to merge a preliminary version with an unexported `do_exec_plan_substrait()` so that Nic and I don't have to rebuild the R package every time we switch projects (but we can also make that work if that sounds too hacky).
   
   To make sure this PR is going to serve its purpose, I also implemented a preliminary substrait package function that calls `do_exec_plan_substrait()` (for reference below).
   
   <details>
   
   substrait_eval_arrow <- function(plan, tables) {
     plan <- as_substrait(plan, "substrait.Plan")
     stopifnot(rlang::is_named2(tables))
   
     # only support plans with exactly one relation in the relations list for now
     stopifnot(length(plan$relations) == 1)
   
     temp_parquet <- vapply(tables, function(i) tempfile(), character(1))
     on.exit(unlink(temp_parquet))
   
     local_file_tables <- lapply(seq_along(tables), function(i) {
       substrait$ReadRel$LocalFiles$create(
         items = list(
           substrait$ReadRel$LocalFiles$FileOrFiles$create(
             uri_file = sprintf("file://%s", temp_parquet[i]),
             format = substrait$ReadRel$LocalFiles$FileOrFiles$FileFormat$
               FILE_FORMAT_PARQUET
           )
         )
       )
     })
     names(local_file_tables) <- names(tables)
   
     table_base_schema <- lapply(tables, as_substrait, "substrait.NamedStruct")
   
     call_for_errors <- sys.call()
   
     # walk the relation tree looking for named tables, replacing
     # with those from local_file_tables
     plan <- rel_tree_modify(plan, "substrait_ReadRel", function(x) {
       if (isTRUE("named_table" %in% names(x))) {
         name <- x$named_table$names
   
         if (!isTRUE(name %in% names(local_file_tables))) {
           rlang::abort(
             sprintf("Named table '%s' not found in `tables`", name),
             call = call_for_errors
           )
         }
   
         if (!identical(x$base_schema, table_base_schema[[name]])) {
           rlang::abort(
             sprintf(
               "Base schema for table '%s' does not match declared base schema",
               name
             ),
             call = call_for_errors
           )
         }
   
         x$named_table <- NULL
         x$local_files <- local_file_tables[[name]]
         x
       } else {
         x
       }
     })
   
     col_names <- substrait_colnames(plan$relations[[1]])
   
     # write parquet files
     Map(arrow::write_parquet, tables, temp_parquet)
   
     # run the exec plan
     getNamespace("arrow")[["do_exec_plan_substrait"]](as.raw(plan), col_names)
   }
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot edited a comment on pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
paleolimbot edited a comment on pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#issuecomment-1061817573


   Ok! This is ready for some eyes. The C++ that works with the ExecPlan is something that I got to work, but it is not elegant (notably, it requires that the output field names are known in advance). As @westonpace mentioned above, the C++ may be able to absorb the implementation for substrait -> RecordBatchReader, which is really what we want.
   
   I'm happy to put this PR on hold until there's more infrastructure to make this easier, although either way we will likely need several follow-up PRs. My preference would be to merge a preliminary version with an unexported `do_exec_plan_substrait()` so that Nic and I don't have to rebuild the R package every time we switch projects (but we can also make that work if that sounds too hacky).
   
   To make sure this PR is going to serve its purpose, I also implemented a preliminary substrait package function that calls `do_exec_plan_substrait()` (for reference below, or see https://github.com/voltrondata/substrait-r/pull/3 for details ).
   
   <details>
   
   ```r
   
   substrait_eval_arrow <- function(plan, tables) {
     plan <- as_substrait(plan, "substrait.Plan")
     stopifnot(rlang::is_named2(tables))
   
     # only support plans with exactly one relation in the relations list for now
     stopifnot(length(plan$relations) == 1)
   
     temp_parquet <- vapply(tables, function(i) tempfile(), character(1))
     on.exit(unlink(temp_parquet))
   
     local_file_tables <- lapply(seq_along(tables), function(i) {
       substrait$ReadRel$LocalFiles$create(
         items = list(
           substrait$ReadRel$LocalFiles$FileOrFiles$create(
             uri_file = sprintf("file://%s", temp_parquet[i]),
             format = substrait$ReadRel$LocalFiles$FileOrFiles$FileFormat$
               FILE_FORMAT_PARQUET
           )
         )
       )
     })
     names(local_file_tables) <- names(tables)
   
     table_base_schema <- lapply(tables, as_substrait, "substrait.NamedStruct")
   
     call_for_errors <- sys.call()
   
     # walk the relation tree looking for named tables, replacing
     # with those from local_file_tables
     plan <- rel_tree_modify(plan, "substrait_ReadRel", function(x) {
       if (isTRUE("named_table" %in% names(x))) {
         name <- x$named_table$names
   
         if (!isTRUE(name %in% names(local_file_tables))) {
           rlang::abort(
             sprintf("Named table '%s' not found in `tables`", name),
             call = call_for_errors
           )
         }
   
         if (!identical(x$base_schema, table_base_schema[[name]])) {
           rlang::abort(
             sprintf(
               "Base schema for table '%s' does not match declared base schema",
               name
             ),
             call = call_for_errors
           )
         }
   
         x$named_table <- NULL
         x$local_files <- local_file_tables[[name]]
         x
       } else {
         x
       }
     })
   
     col_names <- substrait_colnames(plan$relations[[1]])
   
     # write parquet files
     Map(arrow::write_parquet, tables, temp_parquet)
   
     # run the exec plan
     getNamespace("arrow")[["do_exec_plan_substrait"]](as.raw(plan), col_names)
   }
   
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12564: ARROW-15818: [R] Implement initial Substrait consumer in the R bindings

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12564:
URL: https://github.com/apache/arrow/pull/12564#discussion_r836969275



##########
File path: r/tests/testthat/test-query-engine.R
##########
@@ -0,0 +1,86 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+test_that("do_exec_plan_substrait can evaluate a simple plan", {
+  skip_if_not_available("engine")
+
+  df <- data.frame(i = 1:5, b = rep_len(c(TRUE, FALSE), 5))
+  table <- arrow_table(df, schema = schema(i = int64(), b = bool()))
+
+  tf <- tempfile()
+  on.exit(unlink(tf))
+  write_parquet(table, tf)
+
+  substrait_json <- sprintf('{
+    "relations": [
+      {"rel": {
+        "read": {
+          "base_schema": {
+            "struct": {
+              "types": [ {"i64": {}}, {"bool": {}} ]
+            },
+            "names": ["i", "b"]
+          },
+          "local_files": {
+            "items": [
+              {
+                "uri_file": "file://%s",
+                "format": "FILE_FORMAT_PARQUET"
+              }
+            ]
+          }
+        }
+      }}
+    ],
+    "extension_uris": [
+      {
+        "extension_uri_anchor": 7,
+        "uri": "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml"

Review comment:
       Or maybe `https://arrow.apache.org/substrait/v1.0.0/extension_types.yaml`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org