You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "davisusanibar (via GitHub)" <gi...@apache.org> on 2023/05/12 12:59:03 UTC

[GitHub] [arrow] davisusanibar opened a new pull request, #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

davisusanibar opened a new pull request, #35570:
URL: https://github.com/apache/arrow/pull/35570

   ### Rationale for this change
   
   To close https://github.com/apache/arrow/issues/34252
   
   ### What changes are included in this PR?
   
   This is a proposal to try to solve:
   1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
   - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
   - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
   - [x] Create JNI Wrapper for ScannerBuilder::Project 
   - [x] Create JNI API
   - [ ] Testing coverage
   - [ ] Documentation
   
   Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name.
   
   This PR needs/use this PRs:
   - https://github.com/apache/arrow/pull/34834
   - https://github.com/apache/arrow/pull/34227
   
   2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
   - [ ] Working to identify activities
   
   ### Are these changes tested?
   
   Initial unit test added.
   
   ### Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1318989817


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,173 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .substraitProjection(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+        assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
+            .equals("[value_19, value_1, value_11]"));
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .substraitProjection(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish()
+    ) {
+      Exception e = assertThrows(RuntimeException.class, () -> dataset.newScan(options));
+      assertTrue(e.getMessage().startsWith("Only one filter expression may be provided"));
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitProjection(Optional.of(substraitExpressionProject))
+        .substraitFilter(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString()
+            .equals("[21, 3, 13, 23, 47]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString()
+            .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11, " +
+                "value_21 - value_21, value_45 - value_45]"));
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitProjection(Optional.of(substraitExpressionProject))
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString()
+            .equals("[21, 3, 13]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString()
+            .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11]"));
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  private static ByteBuffer getByteBuffer(String base64EncodedSubstraitFilter) {

Review Comment:
   `base64EncodedSubstraitFilter` -> `base64EncodedSubstrait` since this can be both a filter or projection.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1323164894


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitProjection() {
+    return substraitProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitFilter() {
+    return substraitFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private Optional<String[]> columns;
+    private Optional<ByteBuffer> substraitProjection;
+    private Optional<ByteBuffer> substraitFilter;
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     */
+    public Builder(long batchSize) {
+      this.batchSize = batchSize;
+    }
+
+    /**
+     * Set the Projected columns. Empty for scanning all columns.
+     *
+     * @param columns Projected columns. Empty for scanning all columns.
+     * @return the ScanOptions configured.
+     */
+    public Builder columns(Optional<String[]> columns) {

Review Comment:
   One more thing: We don't need `Optional<>` parameters for the builder APIs. We should expect the user to passing us a valid object. Same with `substraitProjection` and `substraitFilter`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218478190


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +206,132 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testDeserializeExtendedExpressions() {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    // get binary plan
+    byte[] expression = Base64.getDecoder().decode(binaryExtendedExpressions);
+    ByteBuffer substraitExpression = ByteBuffer.allocateDirect(expression.length);
+    substraitExpression.put(expression);
+    // deserialize extended expression
+    List<String> extededExpressionList =
+        new AceroSubstraitConsumer(rootAllocator()).runDeserializeExpressions(substraitExpression);
+    assertEquals(3, extededExpressionList.size() / 2);
+    assertEquals("add_two_to_column_a", extededExpressionList.get(0));
+    assertEquals("add(FieldPath(0), 2)", extededExpressionList.get(1));
+    assertEquals("concat_column_a_and_b", extededExpressionList.get(2));
+    assertEquals("binary_join_element_wise(FieldPath(1), FieldPath(1), \"\")", extededExpressionList.get(3));
+    assertEquals("filter_id_lower_than_20", extededExpressionList.get(4));
+    assertEquals("(FieldPath(0) < 20)", extededExpressionList.get(5));
+  }

Review Comment:
   To be clear: I don't think we should have this API, period.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1568864696

   :warning: GitHub issue #34252 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1327264182


##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,349 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's `Extended Expression`_.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    public class ClientSubstraitExtendedExpressionsCookbook {
+        public static void main(String[] args) throws Exception {
+            // project and filter dataset using extended expression definition - 03 Expressions:
+            // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+            // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+            // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+            projectAndFilterDataset();
+        }
+
+        public static void projectAndFilterDataset() {
+            String uri = "file:///Users/data/tpch_parquet/nation.parquet";
+            ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+                    .columns(Optional.empty())
+                    .substraitFilter(getSubstraitExpressionFilter())
+                    .substraitProjection(getSubstraitExpressionProjection())
+                    .build();
+            try (
+                    BufferAllocator allocator = new RootAllocator();
+                    DatasetFactory datasetFactory = new FileSystemDatasetFactory(
+                            allocator, NativeMemoryPool.getDefault(),
+                            FileFormat.PARQUET, uri);
+                    Dataset dataset = datasetFactory.finish();
+                    Scanner scanner = dataset.newScan(options);
+                    ArrowReader reader = scanner.scanBatches()
+            ) {
+                while (reader.loadNextBatch()) {
+                    System.out.println(
+                            reader.getVectorSchemaRoot().contentToTSVString());
+                }
+            } catch (Exception e) {
+                e.printStackTrace();
+            }
+        }
+
+        private static ByteBuffer getSubstraitExpressionProjection() {
+            // Expression: N_REGIONKEY + 10 = col 3 + 10
+            Expression.Builder selectionBuilderProjectOne = Expression.newBuilder().
+                    setSelection(
+                            Expression.FieldReference.newBuilder().
+                                    setDirectReference(
+                                            Expression.ReferenceSegment.newBuilder().
+                                                    setStructField(
+                                                            Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                                                    2)
+                                                    )
+                                    )
+                    );

Review Comment:
   changed
   



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,349 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's `Extended Expression`_.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    public class ClientSubstraitExtendedExpressionsCookbook {
+        public static void main(String[] args) throws Exception {
+            // project and filter dataset using extended expression definition - 03 Expressions:
+            // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+            // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+            // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+            projectAndFilterDataset();
+        }
+
+        public static void projectAndFilterDataset() {
+            String uri = "file:///Users/data/tpch_parquet/nation.parquet";
+            ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+                    .columns(Optional.empty())
+                    .substraitFilter(getSubstraitExpressionFilter())
+                    .substraitProjection(getSubstraitExpressionProjection())
+                    .build();
+            try (
+                    BufferAllocator allocator = new RootAllocator();
+                    DatasetFactory datasetFactory = new FileSystemDatasetFactory(
+                            allocator, NativeMemoryPool.getDefault(),
+                            FileFormat.PARQUET, uri);
+                    Dataset dataset = datasetFactory.finish();
+                    Scanner scanner = dataset.newScan(options);
+                    ArrowReader reader = scanner.scanBatches()
+            ) {
+                while (reader.loadNextBatch()) {
+                    System.out.println(
+                            reader.getVectorSchemaRoot().contentToTSVString());
+                }
+            } catch (Exception e) {
+                e.printStackTrace();
+            }

Review Comment:
   added



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -474,6 +484,39 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
     std::vector<std::string> column_vector = ToStringVector(env, columns);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (substrait_projection != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_projection);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (!(named_expression.expression.type()->id() == arrow::Type::BOOL)) {
+        project_exprs.push_back(std::move(named_expression.expression));
+        project_names.push_back(std::move(named_expression.name));
+      }
+    }
+    JniAssertOkOrThrow(scanner_builder->Project(std::move(project_exprs), std::move(project_names)));
+  }
+  if (substrait_filter != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                                substrait_filter);
+    std::optional<arrow::compute::Expression> filter_expr;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_expr.has_value()) {
+          JniThrow("Only one filter expression may be provided");
+        }
+        filter_expr = named_expression.expression;
+      }

Review Comment:
   added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1306106069


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +83,8 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public ByteBuffer getSubstraitExtendedExpression() {

Review Comment:
   Yes, that will be an option. But currently we could define using the same extended expression:
   - Or a projection
   - Or a filter
   - Or a projection + filter at the same time
   
   I could divide that, then, 03 new options/methods will be created, or 01 methods with 03 new types.... Does it make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1304705797


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {
+    return columnsSubset;
   }
 
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getColumnsProduceOrFilter() {
+    return columnsProduceOrFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columnsSubset;
+    private Optional<ByteBuffer> columnsProduceOrFilter = Optional.empty();
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columnsSubset) {
+      Preconditions.checkNotNull(columnsSubset);
+      this.batchSize = batchSize;
+      this.columnsSubset = columnsSubset;
+    }
+
+    /**
+     * Define binary extended expression message for projects new columns or applies filter.
+     *
+     * @param columnsProduceOrFilter (Optional) Expressions to evaluate to projects new columns or applies filter.
+     * @return the ScanOptions configured.
+     */
+    public Builder columnsProduceOrFilter(Optional<ByteBuffer> columnsProduceOrFilter) {

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1317324936


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,170 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+        assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
+            .equals("[value_19, value_1, value_11]"));

Review Comment:
   Is it possible to compare the actual vector instead of the string?



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test(expected = RuntimeException.class)
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionProject = Base64.getDecoder().decode(binarySubstraitExpressionProject);
+    ByteBuffer substraitExpressionProject = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionProject.length);
+    substraitExpressionProject.put(arrayByteSubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+         .substraitExpressionProjection(substraitExpressionProject)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {

Review Comment:
   I see you added a helper function for base64 encoded filters. I was actually thinking of putting most of the test code in this helper function though like this (pseudo code):
   
   ```
   private ArrowReader scanParquetFileUsingSubstrait(Optional<String> base64EncodedSubstraitFilter, Optional<String> base64EncodedSubstraitProjection) throws Exception {
       final Schema schema = new Schema(Arrays.asList(
           Field.nullable("id", new ArrowType.Int(32, true)),
           Field.nullable("name", new ArrowType.Utf8())
       ), null);
       ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
       ByteBuffer substraitExpressionProjection = getByteBuffer(base64EncodedSubstraitProjection);
       ParquetWriteSupport writeSupport = ParquetWriteSupport
           .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
               11, "value_11", 21, "value_21", 45, "value_45");
       ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
           .columns(Optional.empty())
           .substraitFilter(substraitExpressionFilter)
           .substraitProjection(substraitExpressionProjection)
           .build();
       try (
           DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
               FileFormat.PARQUET, writeSupport.getOutputURI());
           Dataset dataset = datasetFactory.finish();
           Scanner scanner = dataset.newScan(options);
           ArrowReader reader = scanner.scanBatches()
       ) {
         assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
         return reader;
   }
   
   
    @Test
     public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
       // Substrait Extended Expression: Filter:
       // Expression 01: WHERE ID < 20
       String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
           "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
           "BCgRiAhABGAI=";
       try (
           ArrowReader reader = scanParquetFileUsingSubstrait(base64EncodedSubstraitFilter, Optional.empty())
       ) {
         while (reader.loadNextBatch()) {
           assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
           assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
               .equals("[value_19, value_1, value_11]"));
         }
     }
   }
   
    @Test
   ...
   ```



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);

Review Comment:
   Nice!
   
   nit: You can remove the rowcount assertion since you already check for exact values.



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitProjection() {
+    return substraitProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitFilter() {
+    return substraitFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private Optional<String[]> columns;
+    private ByteBuffer substraitProjection;
+    private ByteBuffer substraitFilter;

Review Comment:
   nit: could `Optional<ByteBuffer>` work here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1323252393


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitProjection() {
+    return substraitProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitFilter() {
+    return substraitFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private Optional<String[]> columns;
+    private Optional<ByteBuffer> substraitProjection;
+    private Optional<ByteBuffer> substraitFilter;
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     */
+    public Builder(long batchSize) {
+      this.batchSize = batchSize;
+    }
+
+    /**
+     * Set the Projected columns. Empty for scanning all columns.
+     *
+     * @param columns Projected columns. Empty for scanning all columns.
+     * @return the ScanOptions configured.
+     */
+    public Builder columns(Optional<String[]> columns) {

Review Comment:
   Ah I see. I assumed a user would prefer to only use the `columns` API when they want to project a subset of columns because, if left blank, the builder will build an empty `Optional<> columns` object automatically. I'm okay with leaving as-is. Thanks for the udpates!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1572171797

   HI @lidavidm, this PR enable Dataset Projections and Filters using Substrait Extended Expressions. This depends of another 02 PRs yet:
   - https://github.com/apache/arrow/pull/35798
   - https://github.com/apache/arrow/pull/34834
   
   Could be possible when you have some time to give some feedback for Java side implementation?
   
   Thank you in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1303368552


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -25,8 +26,9 @@
  * Options used during scanning.
  */
 public class ScanOptions {
-  private final Optional<String[]> columns;
+  private final Optional<String[]> columnsSubset;
   private final long batchSize;
+  private Optional<ByteBuffer> columnsProduceOrFilter;

Review Comment:
   The string parsing can and should be kept separate. The actual interface should reflect Substrait. (We could add a convenience method to set the filter via a string, which gets immediately parsed, for instance.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1306123036


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +83,8 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public ByteBuffer getSubstraitExtendedExpression() {

Review Comment:
   ? You'd have `setProjection`, `setFilter`, and either or both could be null (and both are exclusive with `columns`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1313527263


##########
docs/source/java/dataset.rst:
##########
@@ -159,6 +157,26 @@ Or use shortcut construtor:
 
 Then all columns will be emitted during scanning.
 
+Projection (Produce New Columns) and Filters
+============================================
+
+User can specify projections (new columns) or filters in ScanOptions. For example:
+
+.. code-block:: Java
+
+   ByteBuffer substraitExtendedExpressions = ...;

Review Comment:
   Added



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitExpressionProjection() {
+    return substraitExpressionProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitExpressionFilter() {
+    return substraitExpressionFilter;
+  }

Review Comment:
   Changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1313211229


##########
docs/source/java/dataset.rst:
##########
@@ -159,6 +157,26 @@ Or use shortcut construtor:
 
 Then all columns will be emitted during scanning.
 
+Projection (Produce New Columns) and Filters
+============================================
+
+User can specify projections (new columns) or filters in ScanOptions. For example:

Review Comment:
   ```suggestion
   User can specify projections (new columns) or filters in ScanOptions using Substrait. For example:
   ```



##########
docs/source/java/dataset.rst:
##########
@@ -159,6 +157,26 @@ Or use shortcut construtor:
 
 Then all columns will be emitted during scanning.
 
+Projection (Produce New Columns) and Filters
+============================================
+
+User can specify projections (new columns) or filters in ScanOptions. For example:
+
+.. code-block:: Java
+
+   ByteBuffer substraitExtendedExpressions = ...;

Review Comment:
   We need two `ByteBuffer`s in the example now, one for filter and one for projection. They should be passed into the options builder below (instead of the getSubstraitExpressionX() methods).



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitExpressionProjection() {
+    return substraitExpressionProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitExpressionFilter() {
+    return substraitExpressionFilter;
+  }

Review Comment:
   Can we call these `substrait_projection` and `substrait_filter`? I think we can either leave out the word "expression" or else change it "susbstrait_extended_expression_X" if we want to be verbose. I'm curious if other folks have thoughts on readability. Substrait will probably be a new concept to many Arrow Java users so I think it would be good have consistent and clear naming here.
   
   If we change the naming, it would be best to change it everywhere e.g. in JNI/C++, too.



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test(expected = RuntimeException.class)
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }

Review Comment:
   Can we delete the code that should not run after the expected Exception is thrown?



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test(expected = RuntimeException.class)
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionProject = Base64.getDecoder().decode(binarySubstraitExpressionProject);
+    ByteBuffer substraitExpressionProject = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionProject.length);
+    substraitExpressionProject.put(arrayByteSubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+         .substraitExpressionProjection(substraitExpressionProject)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {

Review Comment:
   Optional: There's a decent amount of duplicated code in the test cases. Do you think it would be possible to create a generic function that can be parameterized for each test case?
   
   maybe something like this?
   ```
   private ArrowReader scanParquetFileUsingSubstrait(String base64EncodedSubstraitFilter, String base64EncodedSubstraitProjection) throws Exception {...}
   ```
   



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +

Review Comment:
   Nit: Would it be more descriptive to call this `base64EncodedSubstraitFilter`? Then it would get decoded into a var called `substraitFilter`. I usually don't like to get too picky with naming, but in this case it would be nice to be highly descriptive of what these random-looking strings are. 



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);

Review Comment:
   Optional: Would it be easy to replace the final assertion in these test cases with an exact match of values instead of just row count? If not, then ignore this comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1313246033


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +

Review Comment:
   Nit: Would it be more descriptive to call this `base64EncodedSubstraitFilter`? Then it would get decoded into a var called `substraitFilter`. I usually don't like to get too picky with naming in tests, but in this case it would be nice to be highly descriptive of what these random-looking strings are. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1327794631


##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,350 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's `Extended Expression`_.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    public class ClientSubstraitExtendedExpressionsCookbook {
+
+      public static void main(String[] args) throws Exception {
+        // project and filter dataset using extended expression definition - 03 Expressions:
+        // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+        // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+        // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+        projectAndFilterDataset();
+      }
+
+      public static void projectAndFilterDataset() {
+        //String uri = "file:///Users/dsusanibar/data/tpch_parquet/nation.parquet";
+        String uri = "file:////Users/dsusanibar/voltron/fork/consumer-testing/tests/data/tpch_parquet/nation.parquet";

Review Comment:
   Just updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1214373161


##########
docs/source/java/dataset.rst:
##########
@@ -158,6 +156,21 @@ Or use shortcut construtor:
 
 Then all columns will be emitted during scanning.
 
+Projection (Produce New Columns) and Filters
+============================================
+
+User can specify projections (new columns) or filters in ScanOptions. For example:
+
+.. code-block:: Java
+
+   ByteBuffer substraitExtendedExpressions = ...; // createExtendedExpresionMessageUsingSubstraitPOJOClasses

Review Comment:
   remove the comment? Or write it out as a sentence on a new line (`// Use Substrait APIs to create an Expression and serialize to a ByteBuffer`)



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.

Review Comment:
   ```suggestion
   Dataset also supports projections and filters with Substrait's extended expressions.
   This requires the substrait-java library.
   ```



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.
+
+Here is an example of a Java program that queries a Parquet file to project new
+columns and also filter then based on Extended Expression definitions. This example
+show us:
+
+- Load TPCH parquet file Nation.parquet.
+- Produce new Projections and apply Filter into dataset using extended expression definition.
+    - Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3.
+    - Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10.
+    - Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18.
+
+.. code-block:: Java
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import com.google.protobuf.util.JsonFormat;
+
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+
+    public class ClientSubstraitExtendedExpressions {
+      public static void main(String[] args) throws Exception {
+        // create extended expression for: project two new columns + one filter
+        String binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();

Review Comment:
   Um, why use a String? Just pass it around as a ByteBuffer in the first place.



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.
+
+Here is an example of a Java program that queries a Parquet file to project new
+columns and also filter then based on Extended Expression definitions. This example
+show us:
+
+- Load TPCH parquet file Nation.parquet.
+- Produce new Projections and apply Filter into dataset using extended expression definition.
+    - Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3.
+    - Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10.
+    - Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18.
+
+.. code-block:: Java
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import com.google.protobuf.util.JsonFormat;
+
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+
+    public class ClientSubstraitExtendedExpressions {
+      public static void main(String[] args) throws Exception {
+        // create extended expression for: project two new columns + one filter
+        String binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();
+        // project and filter dataset using extended expression definition - 03 Expressions:
+        // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+        // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+        // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+        projectAndFilterDataset(binaryExtendedExpressions);
+      }
+
+      public static void projectAndFilterDataset(String binaryExtendedExpressions) {
+        String uri = "file:///data/tpch_parquet/nation.parquet";
+        byte[] extendedExpressions = Base64.getDecoder().decode(
+            binaryExtendedExpressions);
+        ByteBuffer substraitExtendedExpressions = ByteBuffer.allocateDirect(
+            extendedExpressions.length);
+        substraitExtendedExpressions.put(extendedExpressions);
+        ScanOptions options = new ScanOptions(/*batchSize*/ 32768,
+            Optional.empty(),
+            Optional.of(substraitExtendedExpressions));
+        try (
+            BufferAllocator allocator = new RootAllocator();
+            DatasetFactory datasetFactory = new FileSystemDatasetFactory(
+                allocator, NativeMemoryPool.getDefault(),
+                FileFormat.PARQUET, uri);
+            Dataset dataset = datasetFactory.finish();
+            Scanner scanner = dataset.newScan(options);
+            ArrowReader reader = scanner.scanBatches()
+        ) {
+          while (reader.loadNextBatch()) {
+            System.out.println(
+                reader.getVectorSchemaRoot().contentToTSVString());
+          }
+        } catch (Exception e) {
+          e.printStackTrace();
+        }
+      }
+
+      private static String createExtendedExpresionMessageUsingPOJOClasses() throws InvalidProtocolBufferException {
+        // Expression: N_REGIONKEY + 10 = col 3 + 10
+        Expression.Builder selectionBuilderProjectOne = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    2)
+                            )
+                    )
+            );
+        Expression.Builder literalBuilderProjectOne = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setI32(10)
+            );
+        io.substrait.proto.Type outputProjectOne = TypeCreator.NULLABLE.I32.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderProjectOne = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(0).
+                    setOutputType(outputProjectOne).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectOne)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            literalBuilderProjectOne)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderProjectOne = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderProjectOne)
+            .addOutputNames("ADD_TEN_TO_COLUMN_N_REGIONKEY");
+
+        // Expression: name || name = N_NAME || "-" || N_COMMENT = col 1 || col 3
+        Expression.Builder selectionBuilderProjectTwo = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    1)
+                            )
+                    )
+            );
+        Expression.Builder selectionBuilderProjectTwoConcatLiteral = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setString(" - ")
+            );
+        Expression.Builder selectionBuilderProjectOneToConcat = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    3)
+                            )
+                    )
+            );
+        io.substrait.proto.Type outputProjectTwo = TypeCreator.NULLABLE.STRING.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderProjectTwo = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(1).
+                    setOutputType(outputProjectTwo).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectTwo)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectTwoConcatLiteral)
+                    ).
+                    addArguments(
+                        2,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectOneToConcat)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderProjectTwo = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderProjectTwo)
+            .addOutputNames("CONCAT_COLUMNS_N_NAME_AND_N_COMMENT");
+
+        // Expression: Filter: N_NATIONKEY > 18 = col 1 > 18
+        Expression.Builder selectionBuilderFilterOne = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    0)
+                            )
+                    )
+            );
+        Expression.Builder literalBuilderFilterOne = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setI32(18)
+            );
+        io.substrait.proto.Type outputFilterOne = TypeCreator.NULLABLE.BOOLEAN.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderFilterOne = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(2).
+                    setOutputType(outputFilterOne).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderFilterOne)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            literalBuilderFilterOne)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderFilterOne = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderFilterOne)
+            .addOutputNames("COLUMN_N_NATIONKEY_GREATER_THAN_18");
+
+        List<String> columnNames = Arrays.asList("N_NATIONKEY", "N_NAME",
+            "N_REGIONKEY", "N_COMMENT");
+        List<Type> dataTypes = Arrays.asList(
+            TypeCreator.NULLABLE.I32,
+            TypeCreator.NULLABLE.STRING,
+            TypeCreator.NULLABLE.I32,
+            TypeCreator.NULLABLE.STRING
+        );
+        //
+        NamedStruct of = NamedStruct.of(
+            columnNames,
+            Type.Struct.builder().fields(dataTypes).nullable(false).build()
+        );
+
+        // Extensions URI
+        HashMap<String, SimpleExtensionURI> extensionUris = new HashMap<>();
+        extensionUris.put(
+            "key-001",
+            SimpleExtensionURI.newBuilder()
+                .setExtensionUriAnchor(1)
+                .setUri("/functions_arithmetic.yaml")
+                .build()
+        );
+        extensionUris.put(
+            "key-002",
+            SimpleExtensionURI.newBuilder()
+                .setExtensionUriAnchor(2)
+                .setUri("/functions_comparison.yaml")
+                .build()
+        );
+
+        // Extensions
+        ArrayList<SimpleExtensionDeclaration> extensions = new ArrayList<>();
+        SimpleExtensionDeclaration extensionFunctionAdd = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(0)
+                    .setName("add:i32_i32")
+                    .setExtensionUriReference(1))
+            .build();
+        SimpleExtensionDeclaration extensionFunctionGreaterThan = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(1)
+                    .setName("concat:vchar")
+                    .setExtensionUriReference(2))
+            .build();
+        SimpleExtensionDeclaration extensionFunctionLowerThan = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(2)
+                    .setName("gt:any_any")
+                    .setExtensionUriReference(2))
+            .build();
+        extensions.add(extensionFunctionAdd);
+        extensions.add(extensionFunctionGreaterThan);
+        extensions.add(extensionFunctionLowerThan);
+
+        // Extended Expression
+        ExtendedExpression.Builder extendedExpressionBuilder =
+            ExtendedExpression.newBuilder().
+                addReferredExpr(0,
+                    expressionReferenceBuilderProjectOne).
+                addReferredExpr(1,
+                    expressionReferenceBuilderProjectTwo).
+                addReferredExpr(2,
+                    expressionReferenceBuilderFilterOne).
+                setBaseSchema(of.toProto());
+        extendedExpressionBuilder.addAllExtensionUris(extensionUris.values());
+        extendedExpressionBuilder.addAllExtensions(extensions);
+
+        ExtendedExpression extendedExpression = extendedExpressionBuilder.build();

Review Comment:
   How stable is this? If the format is likely to change then I wonder if it's worth having so much code that might get stale quickly



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.
+
+Here is an example of a Java program that queries a Parquet file to project new
+columns and also filter then based on Extended Expression definitions. This example
+show us:
+
+- Load TPCH parquet file Nation.parquet.
+- Produce new Projections and apply Filter into dataset using extended expression definition.
+    - Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3.
+    - Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10.
+    - Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18.
+
+.. code-block:: Java
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import com.google.protobuf.util.JsonFormat;
+
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+
+    public class ClientSubstraitExtendedExpressions {
+      public static void main(String[] args) throws Exception {
+        // create extended expression for: project two new columns + one filter
+        String binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();
+        // project and filter dataset using extended expression definition - 03 Expressions:
+        // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+        // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+        // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+        projectAndFilterDataset(binaryExtendedExpressions);
+      }
+
+      public static void projectAndFilterDataset(String binaryExtendedExpressions) {
+        String uri = "file:///data/tpch_parquet/nation.parquet";
+        byte[] extendedExpressions = Base64.getDecoder().decode(
+            binaryExtendedExpressions);
+        ByteBuffer substraitExtendedExpressions = ByteBuffer.allocateDirect(
+            extendedExpressions.length);
+        substraitExtendedExpressions.put(extendedExpressions);
+        ScanOptions options = new ScanOptions(/*batchSize*/ 32768,
+            Optional.empty(),
+            Optional.of(substraitExtendedExpressions));
+        try (
+            BufferAllocator allocator = new RootAllocator();
+            DatasetFactory datasetFactory = new FileSystemDatasetFactory(
+                allocator, NativeMemoryPool.getDefault(),
+                FileFormat.PARQUET, uri);
+            Dataset dataset = datasetFactory.finish();
+            Scanner scanner = dataset.newScan(options);
+            ArrowReader reader = scanner.scanBatches()
+        ) {
+          while (reader.loadNextBatch()) {
+            System.out.println(
+                reader.getVectorSchemaRoot().contentToTSVString());
+          }
+        } catch (Exception e) {
+          e.printStackTrace();
+        }
+      }
+
+      private static String createExtendedExpresionMessageUsingPOJOClasses() throws InvalidProtocolBufferException {
+        // Expression: N_REGIONKEY + 10 = col 3 + 10
+        Expression.Builder selectionBuilderProjectOne = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    2)
+                            )
+                    )
+            );
+        Expression.Builder literalBuilderProjectOne = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setI32(10)
+            );
+        io.substrait.proto.Type outputProjectOne = TypeCreator.NULLABLE.I32.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderProjectOne = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(0).
+                    setOutputType(outputProjectOne).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectOne)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            literalBuilderProjectOne)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderProjectOne = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderProjectOne)
+            .addOutputNames("ADD_TEN_TO_COLUMN_N_REGIONKEY");
+
+        // Expression: name || name = N_NAME || "-" || N_COMMENT = col 1 || col 3
+        Expression.Builder selectionBuilderProjectTwo = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    1)
+                            )
+                    )
+            );
+        Expression.Builder selectionBuilderProjectTwoConcatLiteral = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setString(" - ")
+            );
+        Expression.Builder selectionBuilderProjectOneToConcat = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    3)
+                            )
+                    )
+            );
+        io.substrait.proto.Type outputProjectTwo = TypeCreator.NULLABLE.STRING.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderProjectTwo = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(1).
+                    setOutputType(outputProjectTwo).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectTwo)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectTwoConcatLiteral)
+                    ).
+                    addArguments(
+                        2,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectOneToConcat)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderProjectTwo = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderProjectTwo)
+            .addOutputNames("CONCAT_COLUMNS_N_NAME_AND_N_COMMENT");
+
+        // Expression: Filter: N_NATIONKEY > 18 = col 1 > 18
+        Expression.Builder selectionBuilderFilterOne = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    0)
+                            )
+                    )
+            );
+        Expression.Builder literalBuilderFilterOne = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setI32(18)
+            );
+        io.substrait.proto.Type outputFilterOne = TypeCreator.NULLABLE.BOOLEAN.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderFilterOne = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(2).
+                    setOutputType(outputFilterOne).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderFilterOne)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            literalBuilderFilterOne)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderFilterOne = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderFilterOne)
+            .addOutputNames("COLUMN_N_NATIONKEY_GREATER_THAN_18");
+
+        List<String> columnNames = Arrays.asList("N_NATIONKEY", "N_NAME",
+            "N_REGIONKEY", "N_COMMENT");
+        List<Type> dataTypes = Arrays.asList(
+            TypeCreator.NULLABLE.I32,
+            TypeCreator.NULLABLE.STRING,
+            TypeCreator.NULLABLE.I32,
+            TypeCreator.NULLABLE.STRING
+        );
+        //
+        NamedStruct of = NamedStruct.of(
+            columnNames,
+            Type.Struct.builder().fields(dataTypes).nullable(false).build()
+        );
+
+        // Extensions URI
+        HashMap<String, SimpleExtensionURI> extensionUris = new HashMap<>();
+        extensionUris.put(
+            "key-001",
+            SimpleExtensionURI.newBuilder()
+                .setExtensionUriAnchor(1)
+                .setUri("/functions_arithmetic.yaml")
+                .build()
+        );
+        extensionUris.put(
+            "key-002",
+            SimpleExtensionURI.newBuilder()
+                .setExtensionUriAnchor(2)
+                .setUri("/functions_comparison.yaml")
+                .build()
+        );
+
+        // Extensions
+        ArrayList<SimpleExtensionDeclaration> extensions = new ArrayList<>();
+        SimpleExtensionDeclaration extensionFunctionAdd = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(0)
+                    .setName("add:i32_i32")
+                    .setExtensionUriReference(1))
+            .build();
+        SimpleExtensionDeclaration extensionFunctionGreaterThan = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(1)
+                    .setName("concat:vchar")
+                    .setExtensionUriReference(2))
+            .build();
+        SimpleExtensionDeclaration extensionFunctionLowerThan = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(2)
+                    .setName("gt:any_any")
+                    .setExtensionUriReference(2))
+            .build();
+        extensions.add(extensionFunctionAdd);
+        extensions.add(extensionFunctionGreaterThan);
+        extensions.add(extensionFunctionLowerThan);
+
+        // Extended Expression
+        ExtendedExpression.Builder extendedExpressionBuilder =
+            ExtendedExpression.newBuilder().
+                addReferredExpr(0,
+                    expressionReferenceBuilderProjectOne).
+                addReferredExpr(1,
+                    expressionReferenceBuilderProjectTwo).
+                addReferredExpr(2,
+                    expressionReferenceBuilderFilterOne).
+                setBaseSchema(of.toProto());
+        extendedExpressionBuilder.addAllExtensionUris(extensionUris.values());
+        extendedExpressionBuilder.addAllExtensions(extensions);
+
+        ExtendedExpression extendedExpression = extendedExpressionBuilder.build();
+
+        // Print JSON
+        System.out.println(
+            JsonFormat.printer().includingDefaultValueFields().print(
+                extendedExpression));
+        // Print binary representation
+        System.out.println(Base64.getEncoder().encodeToString(
+            extendedExpression.toByteArray()));

Review Comment:
   ...there's not really much point printing this



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -696,3 +722,44 @@ JNIEXPORT void JNICALL
   JniAssertOkOrThrow(arrow::ExportRecordBatchReader(reader_out, arrow_stream_out));
   JNI_METHOD_END()
 }
+
+/*
+ * Class:     org_apache_arrow_dataset_substrait_JniWrapper
+ * Method:    executeDeserializeExpressions
+ * Signature: (Ljava/nio/ByteBuffer;)[Ljava/lang/String;
+ */
+JNIEXPORT jobjectArray JNICALL
+    Java_org_apache_arrow_dataset_substrait_JniWrapper_executeDeserializeExpressions (
+    JNIEnv* env, jobject, jobject expression) {
+  JNI_METHOD_START
+  auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(expression));
+  int length = env->GetDirectBufferCapacity(expression);
+  std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+  std::memcpy(buffer->mutable_data(), buff, length);
+  // execute expression
+      arrow::engine::BoundExpressions round_tripped =
+    JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+  // validate is not empty!
+  // create response
+  int totalExpression = round_tripped.named_expressions.size();
+  jobjectArray extendedExpressionOutput = (jobjectArray)env->NewObjectArray(totalExpression*2,env->FindClass("java/lang/String"),0);
+  int i; int j = 0;
+  for (i=0; i<totalExpression; i++) {
+    env->SetObjectArrayElement(
+      extendedExpressionOutput,
+      j++,
+      env->NewStringUTF(
+        round_tripped.named_expressions[i].name.c_str()
+      )
+    );
+    env->SetObjectArrayElement(
+      extendedExpressionOutput,
+      j++,
+      env->NewStringUTF(
+        round_tripped.named_expressions[i].expression.ToString().c_str()
+      )
+    );
+  }

Review Comment:
   ```suggestion
     jobjectArray extendedExpressionOutput = (jobjectArray)env->NewObjectArray(totalExpression*2,env->FindClass("java/lang/String"),0);
     int j = 0;
     for (const auto& expression : round_tripped.named_expressions) {
       env->SetObjectArrayElement(
         extendedExpressionOutput,
         j++,
         env->NewStringUTF(expression.name.c_str())
       );
       env->SetObjectArrayElement(
         extendedExpressionOutput,
         j++,
         env->NewStringUTF(expression.expression.ToString().c_str())
       );
     }
   ```



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +206,132 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testDeserializeExtendedExpressions() {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    // get binary plan
+    byte[] expression = Base64.getDecoder().decode(binaryExtendedExpressions);
+    ByteBuffer substraitExpression = ByteBuffer.allocateDirect(expression.length);
+    substraitExpression.put(expression);
+    // deserialize extended expression
+    List<String> extededExpressionList =
+        new AceroSubstraitConsumer(rootAllocator()).runDeserializeExpressions(substraitExpression);
+    assertEquals(3, extededExpressionList.size() / 2);
+    assertEquals("add_two_to_column_a", extededExpressionList.get(0));
+    assertEquals("add(FieldPath(0), 2)", extededExpressionList.get(1));
+    assertEquals("concat_column_a_and_b", extededExpressionList.get(2));
+    assertEquals("binary_join_element_wise(FieldPath(1), FieldPath(1), \"\")", extededExpressionList.get(3));
+    assertEquals("filter_id_lower_than_20", extededExpressionList.get(4));
+    assertEquals("(FieldPath(0) < 20)", extededExpressionList.get(5));
+  }

Review Comment:
   I don't think this test is useful. We're just testing the C++ code without any purpose.



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.
+
+Here is an example of a Java program that queries a Parquet file to project new
+columns and also filter then based on Extended Expression definitions. This example
+show us:
+
+- Load TPCH parquet file Nation.parquet.
+- Produce new Projections and apply Filter into dataset using extended expression definition.
+    - Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3.
+    - Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10.
+    - Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18.
+
+.. code-block:: Java
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import com.google.protobuf.util.JsonFormat;
+
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+
+    public class ClientSubstraitExtendedExpressions {
+      public static void main(String[] args) throws Exception {
+        // create extended expression for: project two new columns + one filter
+        String binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();
+        // project and filter dataset using extended expression definition - 03 Expressions:
+        // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+        // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+        // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18

Review Comment:
   ```suggestion
   ```



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.
+
+Here is an example of a Java program that queries a Parquet file to project new
+columns and also filter then based on Extended Expression definitions. This example
+show us:
+
+- Load TPCH parquet file Nation.parquet.
+- Produce new Projections and apply Filter into dataset using extended expression definition.
+    - Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3.
+    - Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10.
+    - Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18.
+
+.. code-block:: Java
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import com.google.protobuf.util.JsonFormat;
+
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+
+    public class ClientSubstraitExtendedExpressions {
+      public static void main(String[] args) throws Exception {
+        // create extended expression for: project two new columns + one filter

Review Comment:
   ```suggestion
   ```



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.
+
+Here is an example of a Java program that queries a Parquet file to project new
+columns and also filter then based on Extended Expression definitions. This example
+show us:
+
+- Load TPCH parquet file Nation.parquet.
+- Produce new Projections and apply Filter into dataset using extended expression definition.
+    - Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3.
+    - Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10.
+    - Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18.

Review Comment:
   ```suggestion
   This Java program:
   
   - Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
   - Projects two new columns:
       - ``N_NAME || ' - ' || N_COMMENT``
       - ``N_REGIONKEY + 10``
   - Applies a filter: ``N_NATIONKEY > 18``
   ```



##########
java/dataset/src/main/java/org/apache/arrow/dataset/substrait/AceroSubstraitConsumer.java:
##########
@@ -90,6 +91,15 @@ public ArrowReader runQuery(ByteBuffer plan, Map<String, ArrowReader> namedTable
     return execute(plan, namedTables);
   }
 
+  public List<String> runDeserializeExpressions(ByteBuffer plan) {

Review Comment:
   Docstrings? I don't think we want these at all though? 



##########
java/dataset/src/test/java/org/apache/arrow/dataset/TestDataset.java:
##########
@@ -79,6 +79,7 @@ protected List<ArrowRecordBatch> collectTaskData(Scanner scan) {
       List<ArrowRecordBatch> batches = new ArrayList<>();
       while (reader.loadNextBatch()) {
         VectorSchemaRoot root = reader.getVectorSchemaRoot();
+        System.out.println(root.getSchema());

Review Comment:
   ```suggestion
   ```



##########
java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataset.java:
##########
@@ -36,12 +36,14 @@ public NativeDataset(NativeContext context, long datasetId) {
   }
 
   @Override
+  @SuppressWarnings("ArrayToString")

Review Comment:
   What are we suppressing here?



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -56,10 +58,26 @@ public ScanOptions(long batchSize, Optional<String[]> columns) {
     Preconditions.checkNotNull(columns);
     this.batchSize = batchSize;
     this.columns = columns;
+    this.projectExpression = Optional.empty();
+  }
+
+  /**
+   * Constructor.
+   * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   *                Only columns present in the Array will be scanned.
+   * @param projectExpression (Optional) Expressions to evaluate to produce columns
+   */
+  public ScanOptions(long batchSize, Optional<String[]> columns, Optional<ByteBuffer> projectExpression) {

Review Comment:
   We should just use a builder at this point...



##########
java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java:
##########
@@ -67,14 +69,16 @@ private JniWrapper() {
   /**
    * Create Scanner from a Dataset and get the native pointer of the Dataset.
    * @param datasetId the native pointer of the arrow::dataset::Dataset instance.
-   * @param columns desired column names.
-   *                Columns not in this list will not be emitted when performing scan operation. Null equals
-   *                to "all columns".
+   * @param columnsSubset desired column names. Columns not in this list will not be emitted when performing scan
+   *                      operation. Null equals to "all columns".
+   * @param columnsToProduce Set expressions which will be evaluated to produce the materialized columns. Null equals
+   *                         to "no produce".

Review Comment:
   ```suggestion
      * @param columnsToProduce Expressions to materialize new columns (if desired).
   ```



##########
java/dataset/src/main/java/org/apache/arrow/dataset/substrait/JniWrapper.java:
##########
@@ -69,5 +69,8 @@ public native void executeSerializedPlan(String planInput, String[] mapTableToMe
    * @param memoryAddressOutput the memory address where RecordBatchReader is exported.
    */
   public native void executeSerializedPlan(ByteBuffer planInput, String[] mapTableToMemoryAddressInput,
-                                                      long memoryAddressOutput);
+                                           long memoryAddressOutput);
+
+  // add description

Review Comment:
   This needs to be done?



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -696,3 +722,44 @@ JNIEXPORT void JNICALL
   JniAssertOkOrThrow(arrow::ExportRecordBatchReader(reader_out, arrow_stream_out));
   JNI_METHOD_END()
 }
+
+/*
+ * Class:     org_apache_arrow_dataset_substrait_JniWrapper
+ * Method:    executeDeserializeExpressions
+ * Signature: (Ljava/nio/ByteBuffer;)[Ljava/lang/String;
+ */
+JNIEXPORT jobjectArray JNICALL
+    Java_org_apache_arrow_dataset_substrait_JniWrapper_executeDeserializeExpressions (
+    JNIEnv* env, jobject, jobject expression) {
+  JNI_METHOD_START
+  auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(expression));
+  int length = env->GetDirectBufferCapacity(expression);
+  std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+  std::memcpy(buffer->mutable_data(), buff, length);
+  // execute expression
+      arrow::engine::BoundExpressions round_tripped =

Review Comment:
   C++ code needs to be formatted



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +206,132 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testDeserializeExtendedExpressions() {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    // get binary plan
+    byte[] expression = Base64.getDecoder().decode(binaryExtendedExpressions);
+    ByteBuffer substraitExpression = ByteBuffer.allocateDirect(expression.length);
+    substraitExpression.put(expression);
+    // deserialize extended expression
+    List<String> extededExpressionList =
+        new AceroSubstraitConsumer(rootAllocator()).runDeserializeExpressions(substraitExpression);
+    assertEquals(3, extededExpressionList.size() / 2);
+    assertEquals("add_two_to_column_a", extededExpressionList.get(0));
+    assertEquals("add(FieldPath(0), 2)", extededExpressionList.get(1));
+    assertEquals("concat_column_a_and_b", extededExpressionList.get(2));
+    assertEquals("binary_join_element_wise(FieldPath(1), FieldPath(1), \"\")", extededExpressionList.get(3));
+    assertEquals("filter_id_lower_than_20", extededExpressionList.get(4));
+    assertEquals("(FieldPath(0) < 20)", extededExpressionList.get(5));
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    // Base64.getEncoder().encodeToString(plan.toByteArray()): Generated throughout Substrait POJO Extended Expressions
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    Map<String, String> metadataSchema = new HashMap<>();
+    metadataSchema.put("parquet.avro.schema", "{\"type\":\"record\",\"name\":\"Users\"," +
+        "\"namespace\":\"org.apache.arrow.dataset\",\"fields\":[{\"name\":\"id\"," +
+        "\"type\":[\"int\",\"null\"]},{\"name\":\"name\",\"type\":[\"string\",\"null\"]}]}");
+    metadataSchema.put("writer.model.name", "avro");

Review Comment:
   Why do we need this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1303366332


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -25,8 +26,9 @@
  * Options used during scanning.
  */
 public class ScanOptions {
-  private final Optional<String[]> columns;
+  private final Optional<String[]> columnsSubset;
   private final long batchSize;
+  private Optional<ByteBuffer> columnsProduceOrFilter;

Review Comment:
   Hey @lidavidm, what's your take on this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1304702747


##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,323 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's extended expressions.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    public class ClientSubstraitExtendedExpressions {
+        public static void main(String[] args) throws Exception {
+            // create extended expression for: project two new columns + one filter
+            ByteBuffer binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();
+            // project and filter dataset using extended expression definition - 03 Expressions:
+            // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+            // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+            // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+            projectAndFilterDataset(binaryExtendedExpressions);
+        }
+
+        public static void projectAndFilterDataset(ByteBuffer binaryExtendedExpressions) {
+            String uri = "file:////Users/dsusanibar/voltron/fork/consumer-testing/tests/data/tpch_parquet/nation.parquet";

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1304703302


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));

Review Comment:
   changed



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);

Review Comment:
   added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1317996146


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitProjection() {
+    return substraitProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitFilter() {
+    return substraitFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private Optional<String[]> columns;
+    private ByteBuffer substraitProjection;
+    private ByteBuffer substraitFilter;

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1710616183

   I added one last small comment about the tests, but otherwise LGTM! I would approve it if I had the capability.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1319208514


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,173 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .substraitProjection(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+        assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
+            .equals("[value_19, value_1, value_11]"));
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .substraitProjection(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish()
+    ) {
+      Exception e = assertThrows(RuntimeException.class, () -> dataset.newScan(options));
+      assertTrue(e.getMessage().startsWith("Only one filter expression may be provided"));
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitProjection(Optional.of(substraitExpressionProject))
+        .substraitFilter(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString()
+            .equals("[21, 3, 13, 23, 47]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString()
+            .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11, " +
+                "value_21 - value_21, value_45 - value_45]"));
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitProjection(Optional.of(substraitExpressionProject))
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString()
+            .equals("[21, 3, 13]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString()
+            .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11]"));
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  private static ByteBuffer getByteBuffer(String base64EncodedSubstrait) {
+    byte[] substraitFilter = Base64.getDecoder().decode(base64EncodedSubstrait);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(substraitFilter.length);

Review Comment:
   Changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1313230005


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitExpressionProjection() {
+    return substraitExpressionProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitExpressionFilter() {
+    return substraitExpressionFilter;
+  }

Review Comment:
   Can we call these `substrait_projection` and `substrait_filter`? I think we can either leave out the word "expression" or else change it to "susbstrait_extended_expression_X" if we want to be verbose. I'm curious if other folks have thoughts on readability. Substrait will probably be a new concept to many Arrow Java users so I think it would be good have consistent and clear naming here.
   
   If we change the naming, it would be best to change it everywhere e.g. in JNI/C++, too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1313527629


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +

Review Comment:
   Changed



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);

Review Comment:
   Changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218750453


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +206,132 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testDeserializeExtendedExpressions() {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    // get binary plan
+    byte[] expression = Base64.getDecoder().decode(binaryExtendedExpressions);
+    ByteBuffer substraitExpression = ByteBuffer.allocateDirect(expression.length);
+    substraitExpression.put(expression);
+    // deserialize extended expression
+    List<String> extededExpressionList =
+        new AceroSubstraitConsumer(rootAllocator()).runDeserializeExpressions(substraitExpression);
+    assertEquals(3, extededExpressionList.size() / 2);
+    assertEquals("add_two_to_column_a", extededExpressionList.get(0));
+    assertEquals("add(FieldPath(0), 2)", extededExpressionList.get(1));
+    assertEquals("concat_column_a_and_b", extededExpressionList.get(2));
+    assertEquals("binary_join_element_wise(FieldPath(1), FieldPath(1), \"\")", extededExpressionList.get(3));
+    assertEquals("filter_id_lower_than_20", extededExpressionList.get(4));
+    assertEquals("(FieldPath(0) < 20)", extededExpressionList.get(5));
+  }

Review Comment:
   Deleted



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +206,132 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testDeserializeExtendedExpressions() {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    // get binary plan
+    byte[] expression = Base64.getDecoder().decode(binaryExtendedExpressions);
+    ByteBuffer substraitExpression = ByteBuffer.allocateDirect(expression.length);
+    substraitExpression.put(expression);
+    // deserialize extended expression
+    List<String> extededExpressionList =
+        new AceroSubstraitConsumer(rootAllocator()).runDeserializeExpressions(substraitExpression);
+    assertEquals(3, extededExpressionList.size() / 2);
+    assertEquals("add_two_to_column_a", extededExpressionList.get(0));
+    assertEquals("add(FieldPath(0), 2)", extededExpressionList.get(1));
+    assertEquals("concat_column_a_and_b", extededExpressionList.get(2));
+    assertEquals("binary_join_element_wise(FieldPath(1), FieldPath(1), \"\")", extededExpressionList.get(3));
+    assertEquals("filter_id_lower_than_20", extededExpressionList.get(4));
+    assertEquals("(FieldPath(0) < 20)", extededExpressionList.get(5));
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    // Base64.getEncoder().encodeToString(plan.toByteArray()): Generated throughout Substrait POJO Extended Expressions
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    Map<String, String> metadataSchema = new HashMap<>();
+    metadataSchema.put("parquet.avro.schema", "{\"type\":\"record\",\"name\":\"Users\"," +
+        "\"namespace\":\"org.apache.arrow.dataset\",\"fields\":[{\"name\":\"id\"," +
+        "\"type\":[\"int\",\"null\"]},{\"name\":\"name\",\"type\":[\"string\",\"null\"]}]}");
+    metadataSchema.put("writer.model.name", "avro");

Review Comment:
   Compare by Fields



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218478539


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +206,132 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testDeserializeExtendedExpressions() {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    // get binary plan
+    byte[] expression = Base64.getDecoder().decode(binaryExtendedExpressions);
+    ByteBuffer substraitExpression = ByteBuffer.allocateDirect(expression.length);
+    substraitExpression.put(expression);
+    // deserialize extended expression
+    List<String> extededExpressionList =
+        new AceroSubstraitConsumer(rootAllocator()).runDeserializeExpressions(substraitExpression);
+    assertEquals(3, extededExpressionList.size() / 2);
+    assertEquals("add_two_to_column_a", extededExpressionList.get(0));
+    assertEquals("add(FieldPath(0), 2)", extededExpressionList.get(1));
+    assertEquals("concat_column_a_and_b", extededExpressionList.get(2));
+    assertEquals("binary_join_element_wise(FieldPath(1), FieldPath(1), \"\")", extededExpressionList.get(3));
+    assertEquals("filter_id_lower_than_20", extededExpressionList.get(4));
+    assertEquals("(FieldPath(0) < 20)", extededExpressionList.get(5));
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    // Base64.getEncoder().encodeToString(plan.toByteArray()): Generated throughout Substrait POJO Extended Expressions
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    Map<String, String> metadataSchema = new HashMap<>();
+    metadataSchema.put("parquet.avro.schema", "{\"type\":\"record\",\"name\":\"Users\"," +
+        "\"namespace\":\"org.apache.arrow.dataset\",\"fields\":[{\"name\":\"id\"," +
+        "\"type\":[\"int\",\"null\"]},{\"name\":\"name\",\"type\":[\"string\",\"null\"]}]}");
+    metadataSchema.put("writer.model.name", "avro");

Review Comment:
   Can we just compare without metadata? This seems brittle to include.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1327792921


##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,350 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's `Extended Expression`_.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    public class ClientSubstraitExtendedExpressionsCookbook {
+
+      public static void main(String[] args) throws Exception {
+        // project and filter dataset using extended expression definition - 03 Expressions:
+        // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+        // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+        // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+        projectAndFilterDataset();
+      }
+
+      public static void projectAndFilterDataset() {
+        //String uri = "file:///Users/dsusanibar/data/tpch_parquet/nation.parquet";
+        String uri = "file:////Users/dsusanibar/voltron/fork/consumer-testing/tests/data/tpch_parquet/nation.parquet";

Review Comment:
   Sorry, second time same error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1306106654


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -58,6 +60,18 @@ public ScanOptions(long batchSize, Optional<String[]> columns) {
     this.columns = columns;
   }
 
+  /**
+   * Constructor.
+   * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+   * @param substraitExtendedExpression Extended expression to evaluate for project new columns or apply filter.
+   */
+  public ScanOptions(long batchSize, ByteBuffer substraitExtendedExpression) {

Review Comment:
   Oh, ok, let me consider the filter again



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1697887564

   Hi @danepitkin could you help me also with a code review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1318010378


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);

Review Comment:
   The validation is also useful for ensuring rowcount works correctly. How do you feel about this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1318987826


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);

Review Comment:
   Looks good! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zinking commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "zinking (via GitHub)" <gi...@apache.org>.
zinking commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1278744780


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -25,8 +26,9 @@
  * Options used during scanning.
  */
 public class ScanOptions {
-  private final Optional<String[]> columns;
+  private final Optional<String[]> columnsSubset;
   private final long batchSize;
+  private Optional<ByteBuffer> columnsProduceOrFilter;

Review Comment:
   I'd suggest we still expose a simple unbound string in the interface, but leaving the construction (ser/deser) process in the dataset jar itself.
   
   I mean it's good to use subtrait as underlying implementation, but maybe we should just keep the interfaces simple.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1572281182

   > @davisusanibar is it possible to rebase so it's clear which commits are for this PR and which are from the other PRs?
   
   Just merge with main branch and another 02 PRs dependents


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1690411881

   > I guess people are okay with current projection usage and not seeking for subtrait integration for that. for me, I am looking for a method to pass my java filter down to the native scanner. and at this stage only the simplest filter expressions, not the ones with function calls etc ( I guess these can be followed up separately).
   > 
   > in a sense https://github.com/apache/arrow/pull/14287/files satisfies what I wanted but it is active and closed. I'm generally good with using subtrait in the implementation, but I'd suggest let's keep the java interface simple.
   
   Current feature are:
   - Projection
   
   New features added for this PR:
   - Project new column
   - Apply fllters as needed
   
   The main advantage with Substrait is that it offers all the capabilities to define any filter as needed using [extended-expression](https://substrait.io/expressions/extended_expression/).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1304704056


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));

Review Comment:
   Can you give me an example, I didn't catch the idea.



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::compute::Expression filter_expr;
+    int filter_count = 0;
+    for(arrow::engine::NamedExpression named_expression : bounded_expression.named_expressions) {

Review Comment:
   added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1304705238


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -25,8 +26,9 @@
  * Options used during scanning.
  */
 public class ScanOptions {
-  private final Optional<String[]> columns;
+  private final Optional<String[]> columnsSubset;
   private final long batchSize;
+  private Optional<ByteBuffer> columnsProduceOrFilter;

Review Comment:
   deleted



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {

Review Comment:
   rollback



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1312200301


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -458,8 +467,8 @@ JNIEXPORT void JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_closeDataset
  * Signature: (J[Ljava/lang/String;JJ)J
  */
 JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScanner(
-    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns, jlong batch_size,
-    jlong memory_pool_id) {
+    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns,
+    jobject substrait_extended_expression, jlong batch_size, jlong memory_pool_id) {

Review Comment:
   I think that should be fine. At this point we have already crossed the JNI boundary so I don't anticipate much performance impact in making the call twice. Ideally, we should keep the implementation flexible enough such that somebody can easily add support for Acero compute expressions as well. e.g. from the user perspective a `filter` can be either a compute expression or a substrait binary blob that gets parsed into a compute expression (wrapped in named/bounded expression objects). We don't need to add that additional Acero functionality in this PR, though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1312200301


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -458,8 +467,8 @@ JNIEXPORT void JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_closeDataset
  * Signature: (J[Ljava/lang/String;JJ)J
  */
 JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScanner(
-    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns, jlong batch_size,
-    jlong memory_pool_id) {
+    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns,
+    jobject substrait_extended_expression, jlong batch_size, jlong memory_pool_id) {

Review Comment:
   I think that should be fine. At this point we have already crossed the JNI boundary so I don't anticipate much performance impact in making the call twice. Ideally, we should keep the implementation flexible enough such that somebody can easily add support for Acero compute expressions as well. e.g. from the user perspective a `filter` can be either a compute expression or a substrait binary blob that gets parsed into a compute expression (wrapped in named/bounded expression objects). We don't need to add that additional functionality in this PR, though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1312994326


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,93 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm merged pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm merged PR #35570:
URL: https://github.com/apache/arrow/pull/35570


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1319049395


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,173 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .substraitProjection(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+        assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
+            .equals("[value_19, value_1, value_11]"));
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .substraitProjection(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish()
+    ) {
+      Exception e = assertThrows(RuntimeException.class, () -> dataset.newScan(options));
+      assertTrue(e.getMessage().startsWith("Only one filter expression may be provided"));
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitProjection(Optional.of(substraitExpressionProject))
+        .substraitFilter(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString()
+            .equals("[21, 3, 13, 23, 47]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString()
+            .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11, " +
+                "value_21 - value_21, value_45 - value_45]"));
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitProjection(Optional.of(substraitExpressionProject))
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString()
+            .equals("[21, 3, 13]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString()
+            .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11]"));
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  private static ByteBuffer getByteBuffer(String base64EncodedSubstraitFilter) {

Review Comment:
   Thank you, changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1313527809


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test(expected = RuntimeException.class)
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }

Review Comment:
   Deleted



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test(expected = RuntimeException.class)
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionProject = Base64.getDecoder().decode(binarySubstraitExpressionProject);
+    ByteBuffer substraitExpressionProject = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionProject.length);
+    substraitExpressionProject.put(arrayByteSubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+         .substraitExpressionProjection(substraitExpressionProject)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1545707435

   :warning: GitHub issue #34252 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218750132


##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.
+
+Here is an example of a Java program that queries a Parquet file to project new
+columns and also filter then based on Extended Expression definitions. This example
+show us:
+
+- Load TPCH parquet file Nation.parquet.
+- Produce new Projections and apply Filter into dataset using extended expression definition.
+    - Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3.
+    - Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10.
+    - Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18.
+
+.. code-block:: Java
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import com.google.protobuf.util.JsonFormat;
+
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+
+    public class ClientSubstraitExtendedExpressions {
+      public static void main(String[] args) throws Exception {
+        // create extended expression for: project two new columns + one filter
+        String binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();
+        // project and filter dataset using extended expression definition - 03 Expressions:
+        // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+        // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+        // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+        projectAndFilterDataset(binaryExtendedExpressions);
+      }
+
+      public static void projectAndFilterDataset(String binaryExtendedExpressions) {
+        String uri = "file:///data/tpch_parquet/nation.parquet";
+        byte[] extendedExpressions = Base64.getDecoder().decode(
+            binaryExtendedExpressions);
+        ByteBuffer substraitExtendedExpressions = ByteBuffer.allocateDirect(
+            extendedExpressions.length);
+        substraitExtendedExpressions.put(extendedExpressions);
+        ScanOptions options = new ScanOptions(/*batchSize*/ 32768,
+            Optional.empty(),
+            Optional.of(substraitExtendedExpressions));
+        try (
+            BufferAllocator allocator = new RootAllocator();
+            DatasetFactory datasetFactory = new FileSystemDatasetFactory(
+                allocator, NativeMemoryPool.getDefault(),
+                FileFormat.PARQUET, uri);
+            Dataset dataset = datasetFactory.finish();
+            Scanner scanner = dataset.newScan(options);
+            ArrowReader reader = scanner.scanBatches()
+        ) {
+          while (reader.loadNextBatch()) {
+            System.out.println(
+                reader.getVectorSchemaRoot().contentToTSVString());
+          }
+        } catch (Exception e) {
+          e.printStackTrace();
+        }
+      }
+
+      private static String createExtendedExpresionMessageUsingPOJOClasses() throws InvalidProtocolBufferException {
+        // Expression: N_REGIONKEY + 10 = col 3 + 10
+        Expression.Builder selectionBuilderProjectOne = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    2)
+                            )
+                    )
+            );
+        Expression.Builder literalBuilderProjectOne = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setI32(10)
+            );
+        io.substrait.proto.Type outputProjectOne = TypeCreator.NULLABLE.I32.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderProjectOne = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(0).
+                    setOutputType(outputProjectOne).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectOne)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            literalBuilderProjectOne)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderProjectOne = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderProjectOne)
+            .addOutputNames("ADD_TEN_TO_COLUMN_N_REGIONKEY");
+
+        // Expression: name || name = N_NAME || "-" || N_COMMENT = col 1 || col 3
+        Expression.Builder selectionBuilderProjectTwo = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    1)
+                            )
+                    )
+            );
+        Expression.Builder selectionBuilderProjectTwoConcatLiteral = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setString(" - ")
+            );
+        Expression.Builder selectionBuilderProjectOneToConcat = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    3)
+                            )
+                    )
+            );
+        io.substrait.proto.Type outputProjectTwo = TypeCreator.NULLABLE.STRING.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderProjectTwo = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(1).
+                    setOutputType(outputProjectTwo).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectTwo)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectTwoConcatLiteral)
+                    ).
+                    addArguments(
+                        2,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectOneToConcat)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderProjectTwo = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderProjectTwo)
+            .addOutputNames("CONCAT_COLUMNS_N_NAME_AND_N_COMMENT");
+
+        // Expression: Filter: N_NATIONKEY > 18 = col 1 > 18
+        Expression.Builder selectionBuilderFilterOne = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    0)
+                            )
+                    )
+            );
+        Expression.Builder literalBuilderFilterOne = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setI32(18)
+            );
+        io.substrait.proto.Type outputFilterOne = TypeCreator.NULLABLE.BOOLEAN.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderFilterOne = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(2).
+                    setOutputType(outputFilterOne).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderFilterOne)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            literalBuilderFilterOne)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderFilterOne = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderFilterOne)
+            .addOutputNames("COLUMN_N_NATIONKEY_GREATER_THAN_18");
+
+        List<String> columnNames = Arrays.asList("N_NATIONKEY", "N_NAME",
+            "N_REGIONKEY", "N_COMMENT");
+        List<Type> dataTypes = Arrays.asList(
+            TypeCreator.NULLABLE.I32,
+            TypeCreator.NULLABLE.STRING,
+            TypeCreator.NULLABLE.I32,
+            TypeCreator.NULLABLE.STRING
+        );
+        //
+        NamedStruct of = NamedStruct.of(
+            columnNames,
+            Type.Struct.builder().fields(dataTypes).nullable(false).build()
+        );
+
+        // Extensions URI
+        HashMap<String, SimpleExtensionURI> extensionUris = new HashMap<>();
+        extensionUris.put(
+            "key-001",
+            SimpleExtensionURI.newBuilder()
+                .setExtensionUriAnchor(1)
+                .setUri("/functions_arithmetic.yaml")
+                .build()
+        );
+        extensionUris.put(
+            "key-002",
+            SimpleExtensionURI.newBuilder()
+                .setExtensionUriAnchor(2)
+                .setUri("/functions_comparison.yaml")
+                .build()
+        );
+
+        // Extensions
+        ArrayList<SimpleExtensionDeclaration> extensions = new ArrayList<>();
+        SimpleExtensionDeclaration extensionFunctionAdd = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(0)
+                    .setName("add:i32_i32")
+                    .setExtensionUriReference(1))
+            .build();
+        SimpleExtensionDeclaration extensionFunctionGreaterThan = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(1)
+                    .setName("concat:vchar")
+                    .setExtensionUriReference(2))
+            .build();
+        SimpleExtensionDeclaration extensionFunctionLowerThan = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(2)
+                    .setName("gt:any_any")
+                    .setExtensionUriReference(2))
+            .build();
+        extensions.add(extensionFunctionAdd);
+        extensions.add(extensionFunctionGreaterThan);
+        extensions.add(extensionFunctionLowerThan);
+
+        // Extended Expression
+        ExtendedExpression.Builder extendedExpressionBuilder =
+            ExtendedExpression.newBuilder().
+                addReferredExpr(0,
+                    expressionReferenceBuilderProjectOne).
+                addReferredExpr(1,
+                    expressionReferenceBuilderProjectTwo).
+                addReferredExpr(2,
+                    expressionReferenceBuilderFilterOne).
+                setBaseSchema(of.toProto());
+        extendedExpressionBuilder.addAllExtensionUris(extensionUris.values());
+        extendedExpressionBuilder.addAllExtensions(extensions);
+
+        ExtendedExpression extendedExpression = extendedExpressionBuilder.build();
+
+        // Print JSON
+        System.out.println(
+            JsonFormat.printer().includingDefaultValueFields().print(
+                extendedExpression));
+        // Print binary representation
+        System.out.println(Base64.getEncoder().encodeToString(
+            extendedExpression.toByteArray()));

Review Comment:
   Deleted



##########
java/dataset/src/main/java/org/apache/arrow/dataset/substrait/JniWrapper.java:
##########
@@ -69,5 +69,8 @@ public native void executeSerializedPlan(String planInput, String[] mapTableToMe
    * @param memoryAddressOutput the memory address where RecordBatchReader is exported.
    */
   public native void executeSerializedPlan(ByteBuffer planInput, String[] mapTableToMemoryAddressInput,
-                                                      long memoryAddressOutput);
+                                           long memoryAddressOutput);
+
+  // add description

Review Comment:
   Deleted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218750026


##########
docs/source/java/dataset.rst:
##########
@@ -158,6 +156,21 @@ Or use shortcut construtor:
 
 Then all columns will be emitted during scanning.
 
+Projection (Produce New Columns) and Filters
+============================================
+
+User can specify projections (new columns) or filters in ScanOptions. For example:
+
+.. code-block:: Java
+
+   ByteBuffer substraitExtendedExpressions = ...; // createExtendedExpresionMessageUsingSubstraitPOJOClasses

Review Comment:
   Added comment on a new line



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.
+
+Here is an example of a Java program that queries a Parquet file to project new
+columns and also filter then based on Extended Expression definitions. This example
+show us:
+
+- Load TPCH parquet file Nation.parquet.
+- Produce new Projections and apply Filter into dataset using extended expression definition.
+    - Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3.
+    - Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10.
+    - Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18.
+
+.. code-block:: Java
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import com.google.protobuf.util.JsonFormat;
+
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+
+    public class ClientSubstraitExtendedExpressions {
+      public static void main(String[] args) throws Exception {
+        // create extended expression for: project two new columns + one filter
+        String binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();

Review Comment:
   Changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1307724904


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +83,8 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public ByteBuffer getSubstraitExtendedExpression() {

Review Comment:
   There are: projection, filter and projectionAndFilter, that could by used at ScanOptions builder time, then this method is a helper needed for evaluate what of these option was used.
   
   Could you help me to clarify what is your comment when you mention take another look on what is your recommendation please?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1312995725


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -27,6 +28,9 @@
 public class ScanOptions {
   private final Optional<String[]> columns;
   private final long batchSize;
+  private ByteBuffer projection;
+  private ByteBuffer filter;

Review Comment:
   Changed



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +73,106 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  private ByteBuffer getProjection() {

Review Comment:
   Changed



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -458,8 +467,8 @@ JNIEXPORT void JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_closeDataset
  * Signature: (J[Ljava/lang/String;JJ)J
  */
 JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScanner(
-    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns, jlong batch_size,
-    jlong memory_pool_id) {
+    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns,
+    jobject substrait_extended_expression, jlong batch_size, jlong memory_pool_id) {

Review Comment:
   Changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1310926435


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -458,8 +467,8 @@ JNIEXPORT void JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_closeDataset
  * Signature: (J[Ljava/lang/String;JJ)J
  */
 JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScanner(
-    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns, jlong batch_size,
-    jlong memory_pool_id) {
+    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns,
+    jobject substrait_extended_expression, jlong batch_size, jlong memory_pool_id) {

Review Comment:
   Instead of passing `substrait_extended_expression`, can we pass in `filter` and `projection` parameters?



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +73,106 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  private ByteBuffer getProjection() {

Review Comment:
   `getProjection` and `getFilter` should return optional values similar to `getColumns` if we remove `getSubstraitExtendedExpression` (see below).



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -27,6 +28,9 @@
 public class ScanOptions {
   private final Optional<String[]> columns;
   private final long batchSize;
+  private ByteBuffer projection;
+  private ByteBuffer filter;

Review Comment:
   Should projection/filter be `final` if we have a builder for this object? We want the object to be immutable after creation I think.



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,93 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+

Review Comment:
   IMO it would be nice to see separate tests for `filter` and `projection` functionality. 



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +83,8 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public ByteBuffer getSubstraitExtendedExpression() {

Review Comment:
   Hey @davisusanibar , I think we should remove `getProjectionAndFilter` and `getSubstraitExtendedExpression`. If the user wants to set both, they can set filter and projection separately.



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +73,106 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  private ByteBuffer getProjection() {
+    return projection;
+  }
+
+  private ByteBuffer getFilter() {
+    return filter;
+  }
+
+  private ByteBuffer getProjectionAndFilter() {
+    return projectionAndFilter;
+  }
+
+  /**
+   * To evaluate what option was used to define Substrait Extended Expression (Project/Filter).
+   *
+   * @return Substrait Extended Expression configured for project new columns and/or apply filter
+   */
+  public ByteBuffer getSubstraitExtendedExpression() {
+    if (getProjection() != null) {
+      return getProjection();
+    } else if (getFilter() != null) {
+      return getFilter();
+    } else if (getProjectionAndFilter() != null) {
+      return getProjectionAndFilter();
+    } else {
+      return null;
+    }
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columns;
+    private ByteBuffer projection;
+    private ByteBuffer filter;
+    private ByteBuffer projectionAndFilter;
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columns) {

Review Comment:
   Should a Builder API only enforce mandatory args in its constructor (e.g. `batchSize`)? `columns` is optional and can have its own builder method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1303404292


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {
+    return columnsSubset;
   }
 
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getColumnsProduceOrFilter() {
+    return columnsProduceOrFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columnsSubset;
+    private Optional<ByteBuffer> columnsProduceOrFilter = Optional.empty();
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columnsSubset) {
+      Preconditions.checkNotNull(columnsSubset);
+      this.batchSize = batchSize;
+      this.columnsSubset = columnsSubset;
+    }
+
+    /**
+     * Define binary extended expression message for projects new columns or applies filter.

Review Comment:
   ```suggestion
        * Set the Substrait extended expression.
        *
        * <p>Can be used to filter data and/or project new columns.
   ```



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {
+    return columnsSubset;
   }
 
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getColumnsProduceOrFilter() {
+    return columnsProduceOrFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columnsSubset;
+    private Optional<ByteBuffer> columnsProduceOrFilter = Optional.empty();
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columnsSubset) {

Review Comment:
   I'd say for a builder's constructor, there's no need for arguments; just use the builder methods.



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::compute::Expression filter_expr;
+    int filter_count = 0;
+    for(arrow::engine::NamedExpression named_expression : bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_count > 0) {
+          JniThrow("The process only support one filter expression declared");
+        }
+        filter_expr = named_expression.expression;
+        filter_count++;

Review Comment:
   You can track this instead with `optional<Expression> filter_expr`



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::compute::Expression filter_expr;
+    int filter_count = 0;
+    for(arrow::engine::NamedExpression named_expression : bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_count > 0) {
+          JniThrow("The process only support one filter expression declared");

Review Comment:
   ```suggestion
             JniThrow("Only one filter expression may be provided");
   ```



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,323 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's extended expressions.

Review Comment:
   Link to what this means



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,323 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's extended expressions.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    public class ClientSubstraitExtendedExpressions {
+        public static void main(String[] args) throws Exception {
+            // create extended expression for: project two new columns + one filter
+            ByteBuffer binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();
+            // project and filter dataset using extended expression definition - 03 Expressions:
+            // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+            // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+            // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+            projectAndFilterDataset(binaryExtendedExpressions);
+        }
+
+        public static void projectAndFilterDataset(ByteBuffer binaryExtendedExpressions) {
+            String uri = "file:////Users/dsusanibar/voltron/fork/consumer-testing/tests/data/tpch_parquet/nation.parquet";

Review Comment:
   Try not to put our company name in strings?



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {
+    return columnsSubset;
   }
 
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getColumnsProduceOrFilter() {
+    return columnsProduceOrFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columnsSubset;
+    private Optional<ByteBuffer> columnsProduceOrFilter = Optional.empty();
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columnsSubset) {
+      Preconditions.checkNotNull(columnsSubset);
+      this.batchSize = batchSize;
+      this.columnsSubset = columnsSubset;
+    }
+
+    /**
+     * Define binary extended expression message for projects new columns or applies filter.
+     *
+     * @param columnsProduceOrFilter (Optional) Expressions to evaluate to projects new columns or applies filter.
+     * @return the ScanOptions configured.
+     */
+    public Builder columnsProduceOrFilter(Optional<ByteBuffer> columnsProduceOrFilter) {

Review Comment:
   Please revise the rest of the code based on this.



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -25,8 +26,9 @@
  * Options used during scanning.
  */
 public class ScanOptions {
-  private final Optional<String[]> columns;
+  private final Optional<String[]> columnsSubset;
   private final long batchSize;
+  private Optional<ByteBuffer> columnsProduceOrFilter;

Review Comment:
   Why is this not `final` like the others?



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::compute::Expression filter_expr;
+    int filter_count = 0;
+    for(arrow::engine::NamedExpression named_expression : bounded_expression.named_expressions) {

Review Comment:
   const&, or at least use & and move below



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));

Review Comment:
   const?



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {
+    return columnsSubset;
   }
 
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getColumnsProduceOrFilter() {
+    return columnsProduceOrFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columnsSubset;
+    private Optional<ByteBuffer> columnsProduceOrFilter = Optional.empty();
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columnsSubset) {
+      Preconditions.checkNotNull(columnsSubset);
+      this.batchSize = batchSize;
+      this.columnsSubset = columnsSubset;
+    }
+
+    /**
+     * Define binary extended expression message for projects new columns or applies filter.
+     *
+     * @param columnsProduceOrFilter (Optional) Expressions to evaluate to projects new columns or applies filter.
+     * @return the ScanOptions configured.
+     */
+    public Builder columnsProduceOrFilter(Optional<ByteBuffer> columnsProduceOrFilter) {

Review Comment:
   This needs to be named something clearer. `substraitExtendedExpression`?



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {

Review Comment:
   I don't think there's a compelling reason to rename everything. I do think it is time to add proper docstrings.



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::compute::Expression filter_expr;
+    int filter_count = 0;
+    for(arrow::engine::NamedExpression named_expression : bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_count > 0) {
+          JniThrow("The process only support one filter expression declared");
+        }
+        filter_expr = named_expression.expression;
+        filter_count++;
+      } else {
+        project_exprs.push_back(named_expression.expression);
+        project_names.push_back(named_expression.name);
+      }
+    }
+    JniAssertOkOrThrow(scanner_builder->Project(project_exprs, project_names));

Review Comment:
   move



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {
+    return columnsSubset;
   }
 
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getColumnsProduceOrFilter() {
+    return columnsProduceOrFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columnsSubset;
+    private Optional<ByteBuffer> columnsProduceOrFilter = Optional.empty();
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columnsSubset) {
+      Preconditions.checkNotNull(columnsSubset);
+      this.batchSize = batchSize;
+      this.columnsSubset = columnsSubset;
+    }
+
+    /**
+     * Define binary extended expression message for projects new columns or applies filter.
+     *
+     * @param columnsProduceOrFilter (Optional) Expressions to evaluate to projects new columns or applies filter.
+     * @return the ScanOptions configured.
+     */
+    public Builder columnsProduceOrFilter(Optional<ByteBuffer> columnsProduceOrFilter) {

Review Comment:
   Don't take Optional here.



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));

Review Comment:
   That said, you don't necessarily need to copy to a new buffer here if you want to avoid that; you can directly wrap a pointer + length in a buffer (so long as the buffer does not escape)



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);

Review Comment:
   I think we do this a few times by now, factor out a helper



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,323 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's extended expressions.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    public class ClientSubstraitExtendedExpressions {
+        public static void main(String[] args) throws Exception {
+            // create extended expression for: project two new columns + one filter
+            ByteBuffer binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();

Review Comment:
   There is no reason to be so verbose



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::compute::Expression filter_expr;
+    int filter_count = 0;
+    for(arrow::engine::NamedExpression named_expression : bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_count > 0) {
+          JniThrow("The process only support one filter expression declared");
+        }
+        filter_expr = named_expression.expression;
+        filter_count++;

Review Comment:
   That said, we should design the API to separate project and filter expressions in the first place.



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,323 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's extended expressions.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    public class ClientSubstraitExtendedExpressions {
+        public static void main(String[] args) throws Exception {
+            // create extended expression for: project two new columns + one filter
+            ByteBuffer binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();

Review Comment:
   Long method names hurt clarity in the docs where the width is limited



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1572539572

   What I meant is if you could stage this as a rebase on top of those PRs, so that there's a separation. Otherwise I can review as is but I will assume no C++ or Python changes are from this PR specifically


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1572926604

   > What I meant is if you could stage this as a rebase on top of those PRs, so that there's a separation. Otherwise I can review as is but I will assume no C++ or Python changes are from this PR specifically
   
   Please consider that any change as part of cpp/*, python/* won't be part of this PR. Only consider: java/* and docs/*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1323164894


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitProjection() {
+    return substraitProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitFilter() {
+    return substraitFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private Optional<String[]> columns;
+    private Optional<ByteBuffer> substraitProjection;
+    private Optional<ByteBuffer> substraitFilter;
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     */
+    public Builder(long batchSize) {
+      this.batchSize = batchSize;
+    }
+
+    /**
+     * Set the Projected columns. Empty for scanning all columns.
+     *
+     * @param columns Projected columns. Empty for scanning all columns.
+     * @return the ScanOptions configured.
+     */
+    public Builder columns(Optional<String[]> columns) {

Review Comment:
   One more thing: We don't need `Optional<>` parameters for the builder APIs. We should expect the user to pass us a valid object. Same with `substraitProjection` and `substraitFilter`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218429699


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +206,132 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testDeserializeExtendedExpressions() {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    // get binary plan
+    byte[] expression = Base64.getDecoder().decode(binaryExtendedExpressions);
+    ByteBuffer substraitExpression = ByteBuffer.allocateDirect(expression.length);
+    substraitExpression.put(expression);
+    // deserialize extended expression
+    List<String> extededExpressionList =
+        new AceroSubstraitConsumer(rootAllocator()).runDeserializeExpressions(substraitExpression);
+    assertEquals(3, extededExpressionList.size() / 2);
+    assertEquals("add_two_to_column_a", extededExpressionList.get(0));
+    assertEquals("add(FieldPath(0), 2)", extededExpressionList.get(1));
+    assertEquals("concat_column_a_and_b", extededExpressionList.get(2));
+    assertEquals("binary_join_element_wise(FieldPath(1), FieldPath(1), \"\")", extededExpressionList.get(3));
+    assertEquals("filter_id_lower_than_20", extededExpressionList.get(4));
+    assertEquals("(FieldPath(0) < 20)", extededExpressionList.get(5));
+  }

Review Comment:
   This help me to test my binary expression and see what projection or filter was discovered, could be really useful in this way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218750347


##########
java/dataset/src/main/java/org/apache/arrow/dataset/jni/NativeDataset.java:
##########
@@ -36,12 +36,14 @@ public NativeDataset(NativeContext context, long datasetId) {
   }
 
   @Override
+  @SuppressWarnings("ArrayToString")

Review Comment:
   Deleted



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -696,3 +722,44 @@ JNIEXPORT void JNICALL
   JniAssertOkOrThrow(arrow::ExportRecordBatchReader(reader_out, arrow_stream_out));
   JNI_METHOD_END()
 }
+
+/*
+ * Class:     org_apache_arrow_dataset_substrait_JniWrapper
+ * Method:    executeDeserializeExpressions
+ * Signature: (Ljava/nio/ByteBuffer;)[Ljava/lang/String;
+ */
+JNIEXPORT jobjectArray JNICALL
+    Java_org_apache_arrow_dataset_substrait_JniWrapper_executeDeserializeExpressions (
+    JNIEnv* env, jobject, jobject expression) {
+  JNI_METHOD_START
+  auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(expression));
+  int length = env->GetDirectBufferCapacity(expression);
+  std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+  std::memcpy(buffer->mutable_data(), buff, length);
+  // execute expression
+      arrow::engine::BoundExpressions round_tripped =

Review Comment:
   Deleted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218750254


##########
java/dataset/src/main/java/org/apache/arrow/dataset/substrait/AceroSubstraitConsumer.java:
##########
@@ -90,6 +91,15 @@ public ArrowReader runQuery(ByteBuffer plan, Map<String, ArrowReader> namedTable
     return execute(plan, namedTables);
   }
 
+  public List<String> runDeserializeExpressions(ByteBuffer plan) {

Review Comment:
   Deleted



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -56,10 +58,26 @@ public ScanOptions(long batchSize, Optional<String[]> columns) {
     Preconditions.checkNotNull(columns);
     this.batchSize = batchSize;
     this.columns = columns;
+    this.projectExpression = Optional.empty();
+  }
+
+  /**
+   * Constructor.
+   * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   *                Only columns present in the Array will be scanned.
+   * @param projectExpression (Optional) Expressions to evaluate to produce columns
+   */
+  public ScanOptions(long batchSize, Optional<String[]> columns, Optional<ByteBuffer> projectExpression) {

Review Comment:
   Added builder class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1306090414


##########
java/dataset/src/main/java/org/apache/arrow/dataset/source/Dataset.java:
##########
@@ -32,4 +32,14 @@ public interface Dataset extends AutoCloseable {
    * @return the Scanner instance
    */
   Scanner newScan(ScanOptions options);
+
+  /**
+   * Create a new Scanner, using the provided options,
+   * that contains the binary representation of the Substrait
+   * Extended Expression.
+   *
+   * @param options options used during creating Scanner
+   * @return the Scanner instance
+   */
+  Scanner newSubstraitScan(ScanOptions options);

Review Comment:
   The point of having a ScanOptions object is that we shouldn't need a second function like this, right? newScan should just do the right thing based on the options.



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +83,8 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public ByteBuffer getSubstraitExtendedExpression() {

Review Comment:
   Why don't we separate out projection and filter? C++ does this.



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -484,6 +493,56 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   JNI_METHOD_END(-1L)
 }
 
+/*
+ * Class:     org_apache_arrow_dataset_jni_JniWrapper
+ * Method:    createSubstraitScanner
+ * Signature: (JLjava/nio/ByteBuffer;JJ)J
+ */
+JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createSubstraitScanner(
+    JNIEnv* env, jobject, jlong dataset_id, jobject substrait_expr_produce_or_filter, jlong batch_size,
+    jlong memory_pool_id) {
+  JNI_METHOD_START
+  arrow::MemoryPool* pool = reinterpret_cast<arrow::MemoryPool*>(memory_pool_id);
+  if (pool == nullptr) {
+    JniThrow("Memory pool does not exist or has been closed");
+  }
+  std::shared_ptr<arrow::dataset::Dataset> dataset =
+      RetrieveNativeInstance<arrow::dataset::Dataset>(dataset_id);
+  std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
+      JniGetOrThrow(dataset->NewScan());
+  JniAssertOkOrThrow(scanner_builder->Pool(pool));
+  if (substrait_expr_produce_or_filter != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_expr_produce_or_filter);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    std::optional<arrow::compute::Expression> filter_expr;
+    const arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression named_expression :

Review Comment:
   ```suggestion
       arrow::engine::BoundExpressions bounded_expression =
             JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
       for(arrow::engine::NamedExpression& named_expression :
   ```



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -484,6 +493,56 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   JNI_METHOD_END(-1L)
 }
 
+/*
+ * Class:     org_apache_arrow_dataset_jni_JniWrapper
+ * Method:    createSubstraitScanner
+ * Signature: (JLjava/nio/ByteBuffer;JJ)J
+ */
+JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createSubstraitScanner(
+    JNIEnv* env, jobject, jlong dataset_id, jobject substrait_expr_produce_or_filter, jlong batch_size,
+    jlong memory_pool_id) {
+  JNI_METHOD_START
+  arrow::MemoryPool* pool = reinterpret_cast<arrow::MemoryPool*>(memory_pool_id);
+  if (pool == nullptr) {
+    JniThrow("Memory pool does not exist or has been closed");
+  }
+  std::shared_ptr<arrow::dataset::Dataset> dataset =
+      RetrieveNativeInstance<arrow::dataset::Dataset>(dataset_id);
+  std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
+      JniGetOrThrow(dataset->NewScan());
+  JniAssertOkOrThrow(scanner_builder->Pool(pool));
+  if (substrait_expr_produce_or_filter != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_expr_produce_or_filter);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    std::optional<arrow::compute::Expression> filter_expr;
+    const arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression named_expression :
+                                        bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_expr.has_value()) {
+          JniThrow("Only one filter expression may be provided");
+        }
+        filter_expr = named_expression.expression;
+      } else {
+        project_exprs.push_back(named_expression.expression);
+        project_names.push_back(named_expression.name);

Review Comment:
   ```suggestion
           project_exprs.push_back(std::move(named_expression.expression));
           project_names.push_back(std::move(named_expression.name));
   ```



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -484,6 +493,56 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   JNI_METHOD_END(-1L)
 }
 
+/*
+ * Class:     org_apache_arrow_dataset_jni_JniWrapper
+ * Method:    createSubstraitScanner
+ * Signature: (JLjava/nio/ByteBuffer;JJ)J
+ */
+JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createSubstraitScanner(
+    JNIEnv* env, jobject, jlong dataset_id, jobject substrait_expr_produce_or_filter, jlong batch_size,
+    jlong memory_pool_id) {
+  JNI_METHOD_START
+  arrow::MemoryPool* pool = reinterpret_cast<arrow::MemoryPool*>(memory_pool_id);
+  if (pool == nullptr) {
+    JniThrow("Memory pool does not exist or has been closed");
+  }
+  std::shared_ptr<arrow::dataset::Dataset> dataset =
+      RetrieveNativeInstance<arrow::dataset::Dataset>(dataset_id);
+  std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
+      JniGetOrThrow(dataset->NewScan());
+  JniAssertOkOrThrow(scanner_builder->Pool(pool));
+  if (substrait_expr_produce_or_filter != nullptr) {

Review Comment:
   this wasn't renamed?



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +83,8 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public ByteBuffer getSubstraitExtendedExpression() {

Review Comment:
   Either both getters should use Optional or neither should.



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -27,6 +28,7 @@
 public class ScanOptions {
   private final Optional<String[]> columns;
   private final long batchSize;
+  private ByteBuffer substraitExtendedExpression;

Review Comment:
   This must be final.



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -484,6 +493,56 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   JNI_METHOD_END(-1L)
 }
 
+/*
+ * Class:     org_apache_arrow_dataset_jni_JniWrapper
+ * Method:    createSubstraitScanner
+ * Signature: (JLjava/nio/ByteBuffer;JJ)J
+ */
+JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createSubstraitScanner(
+    JNIEnv* env, jobject, jlong dataset_id, jobject substrait_expr_produce_or_filter, jlong batch_size,
+    jlong memory_pool_id) {
+  JNI_METHOD_START
+  arrow::MemoryPool* pool = reinterpret_cast<arrow::MemoryPool*>(memory_pool_id);
+  if (pool == nullptr) {
+    JniThrow("Memory pool does not exist or has been closed");
+  }
+  std::shared_ptr<arrow::dataset::Dataset> dataset =
+      RetrieveNativeInstance<arrow::dataset::Dataset>(dataset_id);
+  std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
+      JniGetOrThrow(dataset->NewScan());
+  JniAssertOkOrThrow(scanner_builder->Pool(pool));
+  if (substrait_expr_produce_or_filter != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_expr_produce_or_filter);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    std::optional<arrow::compute::Expression> filter_expr;
+    const arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression named_expression :
+                                        bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_expr.has_value()) {
+          JniThrow("Only one filter expression may be provided");
+        }
+        filter_expr = named_expression.expression;
+      } else {
+        project_exprs.push_back(named_expression.expression);
+        project_names.push_back(named_expression.name);
+      }
+    }
+    JniAssertOkOrThrow(scanner_builder->Project(std::move(project_exprs), std::move(project_names)));
+    JniAssertOkOrThrow(scanner_builder->Filter(*std::move(filter_expr)));

Review Comment:
   ```suggestion
       JniAssertOkOrThrow(scanner_builder->Filter(*filter_expr));
   ```



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -58,6 +60,18 @@ public ScanOptions(long batchSize, Optional<String[]> columns) {
     this.columns = columns;
   }
 
+  /**
+   * Constructor.
+   * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+   * @param substraitExtendedExpression Extended expression to evaluate for project new columns or apply filter.
+   */
+  public ScanOptions(long batchSize, ByteBuffer substraitExtendedExpression) {

Review Comment:
   We can and still should have a builder, I was just saying that it doesn't make sense to pass Optional as a constructor parameter here...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1327264460


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -474,6 +484,39 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
     std::vector<std::string> column_vector = ToStringVector(env, columns);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (substrait_projection != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_projection);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (!(named_expression.expression.type()->id() == arrow::Type::BOOL)) {
+        project_exprs.push_back(std::move(named_expression.expression));
+        project_names.push_back(std::move(named_expression.name));
+      }
+    }
+    JniAssertOkOrThrow(scanner_builder->Project(std::move(project_exprs), std::move(project_names)));
+  }
+  if (substrait_filter != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                                substrait_filter);
+    std::optional<arrow::compute::Expression> filter_expr;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_expr.has_value()) {
+          JniThrow("Only one filter expression may be provided");
+        }
+        filter_expr = named_expression.expression;
+      }
+    }
+    JniAssertOkOrThrow(scanner_builder->Filter(*filter_expr));

Review Comment:
   changed



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -474,6 +484,39 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
     std::vector<std::string> column_vector = ToStringVector(env, columns);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (substrait_projection != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_projection);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (!(named_expression.expression.type()->id() == arrow::Type::BOOL)) {

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1312181710


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -458,8 +467,8 @@ JNIEXPORT void JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_closeDataset
  * Signature: (J[Ljava/lang/String;JJ)J
  */
 JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScanner(
-    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns, jlong batch_size,
-    jlong memory_pool_id) {
+    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns,
+    jobject substrait_extended_expression, jlong batch_size, jlong memory_pool_id) {

Review Comment:
   It is possible to do that. It may be necessary to invoke Cpp twice to get bounded expressions, once for each filter and once for each projection. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1312200301


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -458,8 +467,8 @@ JNIEXPORT void JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_closeDataset
  * Signature: (J[Ljava/lang/String;JJ)J
  */
 JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScanner(
-    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns, jlong batch_size,
-    jlong memory_pool_id) {
+    JNIEnv* env, jobject, jlong dataset_id, jobjectArray columns,
+    jobject substrait_extended_expression, jlong batch_size, jlong memory_pool_id) {

Review Comment:
   I think that should be fine. At this point we have already crossed the JNI boundary so I don't anticipate much performance impact in making the call twice. Ideally, we should keep the implementation flexible enough such that somebody can easily add support for Acero compute expressions as well. e.g. from the user perspective a `filter` can be either a compute expression or a substrait binary blob that gets parsed into a compute expression (wrapped in named/bounded expression objects). We don't need to add that additional Acero functionality in this PR, though. Just make it easy to extend later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1327782914


##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,350 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's `Extended Expression`_.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    public class ClientSubstraitExtendedExpressionsCookbook {
+
+      public static void main(String[] args) throws Exception {
+        // project and filter dataset using extended expression definition - 03 Expressions:
+        // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+        // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+        // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+        projectAndFilterDataset();
+      }
+
+      public static void projectAndFilterDataset() {
+        //String uri = "file:///Users/dsusanibar/data/tpch_parquet/nation.parquet";
+        String uri = "file:////Users/dsusanibar/voltron/fork/consumer-testing/tests/data/tpch_parquet/nation.parquet";

Review Comment:
   Can we not put company names in strings? And remove the redundant comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1304705501


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {
+    return columnsSubset;
   }
 
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getColumnsProduceOrFilter() {
+    return columnsProduceOrFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columnsSubset;
+    private Optional<ByteBuffer> columnsProduceOrFilter = Optional.empty();
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columnsSubset) {

Review Comment:
   deleted



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -49,24 +51,72 @@ public ScanOptions(String[] columns, long batchSize) {
   /**
    * Constructor.
    * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
-   * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+   * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
    *                Only columns present in the Array will be scanned.
    */
-  public ScanOptions(long batchSize, Optional<String[]> columns) {
-    Preconditions.checkNotNull(columns);
+  public ScanOptions(long batchSize, Optional<String[]> columnsSubset) {
+    Preconditions.checkNotNull(columnsSubset);
     this.batchSize = batchSize;
-    this.columns = columns;
+    this.columnsSubset = columnsSubset;
+    this.columnsProduceOrFilter = Optional.empty();
   }
 
   public ScanOptions(long batchSize) {
     this(batchSize, Optional.empty());
   }
 
-  public Optional<String[]> getColumns() {
-    return columns;
+  public Optional<String[]> getColumnsSubset() {
+    return columnsSubset;
   }
 
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getColumnsProduceOrFilter() {
+    return columnsProduceOrFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columnsSubset;
+    private Optional<ByteBuffer> columnsProduceOrFilter = Optional.empty();
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columnsSubset (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columnsSubset) {
+      Preconditions.checkNotNull(columnsSubset);
+      this.batchSize = batchSize;
+      this.columnsSubset = columnsSubset;
+    }
+
+    /**
+     * Define binary extended expression message for projects new columns or applies filter.
+     *
+     * @param columnsProduceOrFilter (Optional) Expressions to evaluate to projects new columns or applies filter.
+     * @return the ScanOptions configured.
+     */
+    public Builder columnsProduceOrFilter(Optional<ByteBuffer> columnsProduceOrFilter) {

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1304705000


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::compute::Expression filter_expr;
+    int filter_count = 0;
+    for(arrow::engine::NamedExpression named_expression : bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_count > 0) {
+          JniThrow("The process only support one filter expression declared");
+        }
+        filter_expr = named_expression.expression;
+        filter_count++;

Review Comment:
   added



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -470,12 +471,37 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
       JniGetOrThrow(dataset->NewScan());
   JniAssertOkOrThrow(scanner_builder->Pool(pool));
-  if (columns != nullptr) {
-    std::vector<std::string> column_vector = ToStringVector(env, columns);
+  if (columns_subset != nullptr) {
+    std::vector<std::string> column_vector = ToStringVector(env, columns_subset);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (columns_to_produce_or_filter != nullptr) {
+    auto *buff = reinterpret_cast<jbyte*>(env->GetDirectBufferAddress(columns_to_produce_or_filter));
+    int length = env->GetDirectBufferCapacity(columns_to_produce_or_filter);
+    std::shared_ptr<arrow::Buffer> buffer = JniGetOrThrow(arrow::AllocateBuffer(length));
+    std::memcpy(buffer->mutable_data(), buff, length);
+    arrow::engine::BoundExpressions bounded_expression =
+      JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::compute::Expression filter_expr;
+    int filter_count = 0;
+    for(arrow::engine::NamedExpression named_expression : bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_count > 0) {
+          JniThrow("The process only support one filter expression declared");
+        }
+        filter_expr = named_expression.expression;
+        filter_count++;
+      } else {
+        project_exprs.push_back(named_expression.expression);
+        project_names.push_back(named_expression.name);
+      }
+    }
+    JniAssertOkOrThrow(scanner_builder->Project(project_exprs, project_names));

Review Comment:
   added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1313230005


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitExpressionProjection() {
+    return substraitExpressionProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitExpressionFilter() {
+    return substraitExpressionFilter;
+  }

Review Comment:
   Can we call these `substrait_projection` and `substrait_filter`? I think we can either leave out the word "expression" or else change it to "susbstrait_extended_expression_X" if we want to be verbose. I'm curious if other folks have thoughts on readability. Substrait will probably be a new concept to many Arrow Java users so I think it would be good to have consistent and clear naming here.
   
   If we change the naming, it would be best to change it everywhere e.g. in JNI/C++, too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1318007446


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,170 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+        assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
+            .equals("[value_19, value_1, value_11]"));

Review Comment:
   For that purpose, a ValueVector needs to be created on the fly, mutating with fixed data added. It would be possible for me to do that, but it would add a step that will not be relevant to the purpose of this uni test. Let me confirm that this should be added
   
   ```
         IntVector valueVector = new IntVector("id", rootAllocator());
         valueVector.allocateNew(3);
         valueVector.set(0, 19);
         valueVector.set(1, 1);
         valueVector.set(2, 11);
         valueVector.setValueCount(3);
         ...
         assertEquals(reader.getVectorSchemaRoot().getVector("id").toString(), valueVector.toString());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1318986445


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,170 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+        assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
+            .equals("[value_19, value_1, value_11]"));

Review Comment:
   Ah I see. I was hoping there would be a `.to_pojo()` type of option where you could compare to a Java ListArray or something like that. Maybe better not to add too much unnecessary code to the test cases..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1545707372

   * Closes: #34252


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1561959444

   :warning: GitHub issue #34252 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218428424


##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,335 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Using `Extended Expression`_ we could leverage our current Dataset operations to
+also support Projections and Filters by. To gain access to Projections and Filters
+is needed to define that operations using current Extended Expression Java POJO
+classes defined into `Substrait Java`_ project.
+
+Here is an example of a Java program that queries a Parquet file to project new
+columns and also filter then based on Extended Expression definitions. This example
+show us:
+
+- Load TPCH parquet file Nation.parquet.
+- Produce new Projections and apply Filter into dataset using extended expression definition.
+    - Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3.
+    - Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10.
+    - Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18.
+
+.. code-block:: Java
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import com.google.protobuf.util.JsonFormat;
+
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+
+    public class ClientSubstraitExtendedExpressions {
+      public static void main(String[] args) throws Exception {
+        // create extended expression for: project two new columns + one filter
+        String binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();
+        // project and filter dataset using extended expression definition - 03 Expressions:
+        // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+        // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+        // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+        projectAndFilterDataset(binaryExtendedExpressions);
+      }
+
+      public static void projectAndFilterDataset(String binaryExtendedExpressions) {
+        String uri = "file:///data/tpch_parquet/nation.parquet";
+        byte[] extendedExpressions = Base64.getDecoder().decode(
+            binaryExtendedExpressions);
+        ByteBuffer substraitExtendedExpressions = ByteBuffer.allocateDirect(
+            extendedExpressions.length);
+        substraitExtendedExpressions.put(extendedExpressions);
+        ScanOptions options = new ScanOptions(/*batchSize*/ 32768,
+            Optional.empty(),
+            Optional.of(substraitExtendedExpressions));
+        try (
+            BufferAllocator allocator = new RootAllocator();
+            DatasetFactory datasetFactory = new FileSystemDatasetFactory(
+                allocator, NativeMemoryPool.getDefault(),
+                FileFormat.PARQUET, uri);
+            Dataset dataset = datasetFactory.finish();
+            Scanner scanner = dataset.newScan(options);
+            ArrowReader reader = scanner.scanBatches()
+        ) {
+          while (reader.loadNextBatch()) {
+            System.out.println(
+                reader.getVectorSchemaRoot().contentToTSVString());
+          }
+        } catch (Exception e) {
+          e.printStackTrace();
+        }
+      }
+
+      private static String createExtendedExpresionMessageUsingPOJOClasses() throws InvalidProtocolBufferException {
+        // Expression: N_REGIONKEY + 10 = col 3 + 10
+        Expression.Builder selectionBuilderProjectOne = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    2)
+                            )
+                    )
+            );
+        Expression.Builder literalBuilderProjectOne = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setI32(10)
+            );
+        io.substrait.proto.Type outputProjectOne = TypeCreator.NULLABLE.I32.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderProjectOne = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(0).
+                    setOutputType(outputProjectOne).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectOne)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            literalBuilderProjectOne)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderProjectOne = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderProjectOne)
+            .addOutputNames("ADD_TEN_TO_COLUMN_N_REGIONKEY");
+
+        // Expression: name || name = N_NAME || "-" || N_COMMENT = col 1 || col 3
+        Expression.Builder selectionBuilderProjectTwo = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    1)
+                            )
+                    )
+            );
+        Expression.Builder selectionBuilderProjectTwoConcatLiteral = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setString(" - ")
+            );
+        Expression.Builder selectionBuilderProjectOneToConcat = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    3)
+                            )
+                    )
+            );
+        io.substrait.proto.Type outputProjectTwo = TypeCreator.NULLABLE.STRING.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderProjectTwo = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(1).
+                    setOutputType(outputProjectTwo).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectTwo)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectTwoConcatLiteral)
+                    ).
+                    addArguments(
+                        2,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderProjectOneToConcat)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderProjectTwo = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderProjectTwo)
+            .addOutputNames("CONCAT_COLUMNS_N_NAME_AND_N_COMMENT");
+
+        // Expression: Filter: N_NATIONKEY > 18 = col 1 > 18
+        Expression.Builder selectionBuilderFilterOne = Expression.newBuilder().
+            setSelection(
+                Expression.FieldReference.newBuilder().
+                    setDirectReference(
+                        Expression.ReferenceSegment.newBuilder().
+                            setStructField(
+                                Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                    0)
+                            )
+                    )
+            );
+        Expression.Builder literalBuilderFilterOne = Expression.newBuilder()
+            .setLiteral(
+                Expression.Literal.newBuilder().setI32(18)
+            );
+        io.substrait.proto.Type outputFilterOne = TypeCreator.NULLABLE.BOOLEAN.accept(
+            new TypeProtoConverter());
+        Expression.Builder expressionBuilderFilterOne = Expression.
+            newBuilder().
+            setScalarFunction(
+                Expression.
+                    ScalarFunction.
+                    newBuilder().
+                    setFunctionReference(2).
+                    setOutputType(outputFilterOne).
+                    addArguments(
+                        0,
+                        FunctionArgument.newBuilder().setValue(
+                            selectionBuilderFilterOne)
+                    ).
+                    addArguments(
+                        1,
+                        FunctionArgument.newBuilder().setValue(
+                            literalBuilderFilterOne)
+                    )
+            );
+        ExpressionReference.Builder expressionReferenceBuilderFilterOne = ExpressionReference.newBuilder().
+            setExpression(expressionBuilderFilterOne)
+            .addOutputNames("COLUMN_N_NATIONKEY_GREATER_THAN_18");
+
+        List<String> columnNames = Arrays.asList("N_NATIONKEY", "N_NAME",
+            "N_REGIONKEY", "N_COMMENT");
+        List<Type> dataTypes = Arrays.asList(
+            TypeCreator.NULLABLE.I32,
+            TypeCreator.NULLABLE.STRING,
+            TypeCreator.NULLABLE.I32,
+            TypeCreator.NULLABLE.STRING
+        );
+        //
+        NamedStruct of = NamedStruct.of(
+            columnNames,
+            Type.Struct.builder().fields(dataTypes).nullable(false).build()
+        );
+
+        // Extensions URI
+        HashMap<String, SimpleExtensionURI> extensionUris = new HashMap<>();
+        extensionUris.put(
+            "key-001",
+            SimpleExtensionURI.newBuilder()
+                .setExtensionUriAnchor(1)
+                .setUri("/functions_arithmetic.yaml")
+                .build()
+        );
+        extensionUris.put(
+            "key-002",
+            SimpleExtensionURI.newBuilder()
+                .setExtensionUriAnchor(2)
+                .setUri("/functions_comparison.yaml")
+                .build()
+        );
+
+        // Extensions
+        ArrayList<SimpleExtensionDeclaration> extensions = new ArrayList<>();
+        SimpleExtensionDeclaration extensionFunctionAdd = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(0)
+                    .setName("add:i32_i32")
+                    .setExtensionUriReference(1))
+            .build();
+        SimpleExtensionDeclaration extensionFunctionGreaterThan = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(1)
+                    .setName("concat:vchar")
+                    .setExtensionUriReference(2))
+            .build();
+        SimpleExtensionDeclaration extensionFunctionLowerThan = SimpleExtensionDeclaration.newBuilder()
+            .setExtensionFunction(
+                SimpleExtensionDeclaration.ExtensionFunction.newBuilder()
+                    .setFunctionAnchor(2)
+                    .setName("gt:any_any")
+                    .setExtensionUriReference(2))
+            .build();
+        extensions.add(extensionFunctionAdd);
+        extensions.add(extensionFunctionGreaterThan);
+        extensions.add(extensionFunctionLowerThan);
+
+        // Extended Expression
+        ExtendedExpression.Builder extendedExpressionBuilder =
+            ExtendedExpression.newBuilder().
+                addReferredExpr(0,
+                    expressionReferenceBuilderProjectOne).
+                addReferredExpr(1,
+                    expressionReferenceBuilderProjectTwo).
+                addReferredExpr(2,
+                    expressionReferenceBuilderFilterOne).
+                setBaseSchema(of.toProto());
+        extendedExpressionBuilder.addAllExtensionUris(extensionUris.values());
+        extendedExpressionBuilder.addAllExtensions(extensions);
+
+        ExtendedExpression extendedExpression = extendedExpressionBuilder.build();

Review Comment:
   The plan is to be aligned with last changes creating a java cookbook to testing this code snippet after the merge of this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1572214530

   @davisusanibar is it possible to rebase so it's clear which commits are for this PR and which are from the other PRs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1572167813

   :warning: GitHub issue #34252 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1323252393


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitProjection() {
+    return substraitProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitFilter() {
+    return substraitFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private Optional<String[]> columns;
+    private Optional<ByteBuffer> substraitProjection;
+    private Optional<ByteBuffer> substraitFilter;
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     */
+    public Builder(long batchSize) {
+      this.batchSize = batchSize;
+    }
+
+    /**
+     * Set the Projected columns. Empty for scanning all columns.
+     *
+     * @param columns Projected columns. Empty for scanning all columns.
+     * @return the ScanOptions configured.
+     */
+    public Builder columns(Optional<String[]> columns) {

Review Comment:
   Ah I see. I assumed a user would prefer to only use the `columns` API when they want to project a subset of columns because if left blank, the builder will build an empty `Optional<> columns` object automatically. I'm okay with leaving as-is. Thanks for the udpates!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1326390844


##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -474,6 +484,39 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
     std::vector<std::string> column_vector = ToStringVector(env, columns);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (substrait_projection != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_projection);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (!(named_expression.expression.type()->id() == arrow::Type::BOOL)) {
+        project_exprs.push_back(std::move(named_expression.expression));
+        project_names.push_back(std::move(named_expression.name));
+      }
+    }
+    JniAssertOkOrThrow(scanner_builder->Project(std::move(project_exprs), std::move(project_names)));
+  }
+  if (substrait_filter != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                                substrait_filter);
+    std::optional<arrow::compute::Expression> filter_expr;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_expr.has_value()) {
+          JniThrow("Only one filter expression may be provided");
+        }
+        filter_expr = named_expression.expression;
+      }
+    }
+    JniAssertOkOrThrow(scanner_builder->Filter(*filter_expr));

Review Comment:
   This will crash if you provide an empty list of expressions.



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -474,6 +484,39 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
     std::vector<std::string> column_vector = ToStringVector(env, columns);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (substrait_projection != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_projection);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (!(named_expression.expression.type()->id() == arrow::Type::BOOL)) {

Review Comment:
   ```suggestion
         if (named_expression.expression.type()->id() != arrow::Type::BOOL) {
   ```



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,349 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's `Extended Expression`_.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    public class ClientSubstraitExtendedExpressionsCookbook {
+        public static void main(String[] args) throws Exception {
+            // project and filter dataset using extended expression definition - 03 Expressions:
+            // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+            // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+            // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+            projectAndFilterDataset();
+        }
+
+        public static void projectAndFilterDataset() {
+            String uri = "file:///Users/data/tpch_parquet/nation.parquet";
+            ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+                    .columns(Optional.empty())
+                    .substraitFilter(getSubstraitExpressionFilter())
+                    .substraitProjection(getSubstraitExpressionProjection())
+                    .build();
+            try (
+                    BufferAllocator allocator = new RootAllocator();
+                    DatasetFactory datasetFactory = new FileSystemDatasetFactory(
+                            allocator, NativeMemoryPool.getDefault(),
+                            FileFormat.PARQUET, uri);
+                    Dataset dataset = datasetFactory.finish();
+                    Scanner scanner = dataset.newScan(options);
+                    ArrowReader reader = scanner.scanBatches()
+            ) {
+                while (reader.loadNextBatch()) {
+                    System.out.println(
+                            reader.getVectorSchemaRoot().contentToTSVString());
+                }
+            } catch (Exception e) {
+                e.printStackTrace();
+            }

Review Comment:
   just declare everything as `throws Exception`



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -474,6 +484,39 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
     std::vector<std::string> column_vector = ToStringVector(env, columns);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (substrait_projection != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_projection);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (!(named_expression.expression.type()->id() == arrow::Type::BOOL)) {
+        project_exprs.push_back(std::move(named_expression.expression));
+        project_names.push_back(std::move(named_expression.name));
+      }
+    }
+    JniAssertOkOrThrow(scanner_builder->Project(std::move(project_exprs), std::move(project_names)));
+  }
+  if (substrait_filter != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                                substrait_filter);
+    std::optional<arrow::compute::Expression> filter_expr;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
+        if (filter_expr.has_value()) {
+          JniThrow("Only one filter expression may be provided");
+        }
+        filter_expr = named_expression.expression;
+      }

Review Comment:
   Throw if the expression is not of type BOOL.



##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,170 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+        assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
+            .equals("[value_19, value_1, value_11]"));

Review Comment:
   I think we should consider what to do here; possibly an Iterator/Iterable/Stream that gives Java objects would be sufficient (and you could collect into a Java collection and then use standard assertions). Can you file a follow-up task?



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -474,6 +484,39 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
     std::vector<std::string> column_vector = ToStringVector(env, columns);
     JniAssertOkOrThrow(scanner_builder->Project(column_vector));
   }
+  if (substrait_projection != nullptr) {
+    std::shared_ptr<arrow::Buffer> buffer = LoadArrowBufferFromByteBuffer(env,
+                                                            substrait_projection);
+    std::vector<arrow::compute::Expression> project_exprs;
+    std::vector<std::string> project_names;
+    arrow::engine::BoundExpressions bounded_expression =
+          JniGetOrThrow(arrow::engine::DeserializeExpressions(*buffer));
+    for(arrow::engine::NamedExpression& named_expression :
+                                        bounded_expression.named_expressions) {
+      if (!(named_expression.expression.type()->id() == arrow::Type::BOOL)) {

Review Comment:
   Why do we have this in the first place? I think this was left over from a refactor as seen below. It should be perfectly fine to project a BOOL column.



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,349 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's `Extended Expression`_.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    public class ClientSubstraitExtendedExpressionsCookbook {
+        public static void main(String[] args) throws Exception {
+            // project and filter dataset using extended expression definition - 03 Expressions:
+            // Expression 01 - CONCAT: N_NAME || ' - ' || N_COMMENT = col 1 || ' - ' || col 3
+            // Expression 02 - ADD: N_REGIONKEY + 10 = col 1 + 10
+            // Expression 03 - FILTER: N_NATIONKEY > 18 = col 3 > 18
+            projectAndFilterDataset();
+        }
+
+        public static void projectAndFilterDataset() {
+            String uri = "file:///Users/data/tpch_parquet/nation.parquet";
+            ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+                    .columns(Optional.empty())
+                    .substraitFilter(getSubstraitExpressionFilter())
+                    .substraitProjection(getSubstraitExpressionProjection())
+                    .build();
+            try (
+                    BufferAllocator allocator = new RootAllocator();
+                    DatasetFactory datasetFactory = new FileSystemDatasetFactory(
+                            allocator, NativeMemoryPool.getDefault(),
+                            FileFormat.PARQUET, uri);
+                    Dataset dataset = datasetFactory.finish();
+                    Scanner scanner = dataset.newScan(options);
+                    ArrowReader reader = scanner.scanBatches()
+            ) {
+                while (reader.loadNextBatch()) {
+                    System.out.println(
+                            reader.getVectorSchemaRoot().contentToTSVString());
+                }
+            } catch (Exception e) {
+                e.printStackTrace();
+            }
+        }
+
+        private static ByteBuffer getSubstraitExpressionProjection() {
+            // Expression: N_REGIONKEY + 10 = col 3 + 10
+            Expression.Builder selectionBuilderProjectOne = Expression.newBuilder().
+                    setSelection(
+                            Expression.FieldReference.newBuilder().
+                                    setDirectReference(
+                                            Expression.ReferenceSegment.newBuilder().
+                                                    setStructField(
+                                                            Expression.ReferenceSegment.StructField.newBuilder().setField(
+                                                                    2)
+                                                    )
+                                    )
+                    );

Review Comment:
   nit: how are you formatting examples? I think it would make sense to just use `google-java-format` which has a more compact style. In our docs, going too far to the right is unreadable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1304702165


##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,323 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's extended expressions.

Review Comment:
   added



##########
docs/source/java/substrait.rst:
##########
@@ -102,6 +104,323 @@ Here is an example of a Java program that queries a Parquet file using Java Subs
     0	ALGERIA	0	 haggle. carefully final deposits detect slyly agai
     1	ARGENTINA	1	al foxes promise slyly according to the regular accounts. bold requests alon
 
+Executing Projections and Filters Using Extended Expressions
+============================================================
+
+Dataset also supports projections and filters with Substrait's extended expressions.
+This requires the substrait-java library.
+
+This Java program:
+
+- Loads a Parquet file containing the "nation" table from the TPC-H benchmark.
+- Projects two new columns:
+    - ``N_NAME || ' - ' || N_COMMENT``
+    - ``N_REGIONKEY + 10``
+- Applies a filter: ``N_NATIONKEY > 18``
+
+.. code-block:: Java
+
+    import com.google.protobuf.InvalidProtocolBufferException;
+    import io.substrait.extension.ExtensionCollector;
+    import io.substrait.proto.Expression;
+    import io.substrait.proto.ExpressionReference;
+    import io.substrait.proto.ExtendedExpression;
+    import io.substrait.proto.FunctionArgument;
+    import io.substrait.proto.SimpleExtensionDeclaration;
+    import io.substrait.proto.SimpleExtensionURI;
+    import io.substrait.type.NamedStruct;
+    import io.substrait.type.Type;
+    import io.substrait.type.TypeCreator;
+    import io.substrait.type.proto.TypeProtoConverter;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+
+    import java.nio.ByteBuffer;
+    import java.util.ArrayList;
+    import java.util.Arrays;
+    import java.util.Base64;
+    import java.util.HashMap;
+    import java.util.List;
+    import java.util.Optional;
+
+    public class ClientSubstraitExtendedExpressions {
+        public static void main(String[] args) throws Exception {
+            // create extended expression for: project two new columns + one filter
+            ByteBuffer binaryExtendedExpressions = createExtendedExpresionMessageUsingPOJOClasses();

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1728555980

   After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 00481a2799420f8f00ca7fc137769c1c99186977.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/16984163362) has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1307632606


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +83,8 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public ByteBuffer getSubstraitExtendedExpression() {

Review Comment:
   Added



##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +83,8 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public ByteBuffer getSubstraitExtendedExpression() {

Review Comment:
   Thank you, added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1307689559


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +83,8 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public ByteBuffer getSubstraitExtendedExpression() {

Review Comment:
   Please take another look at this. I'm not going to review this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zinking commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "zinking (via GitHub)" <gi...@apache.org>.
zinking commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1657441660

   I guess people are okay with current projection usage and not seeking for subtrait integration for that.  
   for me, I am looking for a method to pass my java filter down to the native scanner.  and at this stage only the simplest filter expressions, not the ones with function calls etc ( I guess these can be followed up separately). 
   
   in a sense https://github.com/apache/arrow/pull/14287/files satisfies what I wanted but it is active and closed. 
   I'm generally good with using subtrait in the implementation, but I'd suggest let's keep the java interface simple.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35570:
URL: https://github.com/apache/arrow/pull/35570#issuecomment-1568866085

   :warning: GitHub issue #34252 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1218432494


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +206,132 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testDeserializeExtendedExpressions() {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    // get binary plan
+    byte[] expression = Base64.getDecoder().decode(binaryExtendedExpressions);
+    ByteBuffer substraitExpression = ByteBuffer.allocateDirect(expression.length);
+    substraitExpression.put(expression);
+    // deserialize extended expression
+    List<String> extededExpressionList =
+        new AceroSubstraitConsumer(rootAllocator()).runDeserializeExpressions(substraitExpression);
+    assertEquals(3, extededExpressionList.size() / 2);
+    assertEquals("add_two_to_column_a", extededExpressionList.get(0));
+    assertEquals("add(FieldPath(0), 2)", extededExpressionList.get(1));
+    assertEquals("concat_column_a_and_b", extededExpressionList.get(2));
+    assertEquals("binary_join_element_wise(FieldPath(1), FieldPath(1), \"\")", extededExpressionList.get(3));
+    assertEquals("filter_id_lower_than_20", extededExpressionList.get(4));
+    assertEquals("(FieldPath(0) < 20)", extededExpressionList.get(5));
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {
+    // Extended Expression 01 (`add` `2` to column `id`): id + 2
+    // Extended Expression 02 (`concatenate` column `name` || column `name`): name || name
+    // Extended Expression 03 (`filter` 'id' < 20): id < 20
+    // Extended expression result: [add_two_to_column_a, add(FieldPath(0), 2),
+    // concat_column_a_and_b, binary_join_element_wise(FieldPath(1), FieldPath(1), ""),
+    // filter_one, (FieldPath(0) < 20)]
+    // Base64.getEncoder().encodeToString(plan.toByteArray()): Generated throughout Substrait POJO Extended Expressions
+    String binaryExtendedExpressions = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwKHggCEhovZnVuY3Rpb25zX2NvbXBhcmlz" +
+        "b24ueWFtbBIRGg8IARoLYWRkOmkzMl9pMzISFBoSCAIQARoMY29uY2F0OnZjaGFyEhIaEAgCEAIaCmx0OmFueV9hbnkaMQoaGhgaBCoCEAE" +
+        "iCBoGEgQKAhIAIgYaBAoCKAIaE2FkZF90d29fdG9fY29sdW1uX2EaOwoiGiAIARoEYgIQASIKGggSBgoEEgIIASIKGggSBgoEEgIIARoVY2" +
+        "9uY2F0X2NvbHVtbl9hX2FuZF9iGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzIwIhoKA" +
+        "klECgROQU1FEg4KBCoCEAEKBGICEAEYAg==";
+    Map<String, String> metadataSchema = new HashMap<>();
+    metadataSchema.put("parquet.avro.schema", "{\"type\":\"record\",\"name\":\"Users\"," +
+        "\"namespace\":\"org.apache.arrow.dataset\",\"fields\":[{\"name\":\"id\"," +
+        "\"type\":[\"int\",\"null\"]},{\"name\":\"name\",\"type\":[\"string\",\"null\"]}]}");
+    metadataSchema.put("writer.model.name", "avro");

Review Comment:
   I just discovered this, all the Dataset response message attach this schema metadata in their response messages. This was not detected because only Fields or Data was compared but if all schema is needed to compare we nee to add this metadata to the expected messages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1307634778


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -27,6 +28,7 @@
 public class ScanOptions {
   private final Optional<String[]> columns;
   private final long batchSize;
+  private ByteBuffer substraitExtendedExpression;

Review Comment:
   This is optional, is up to the user to configure this or not, it is part of the builder option



##########
java/dataset/src/main/cpp/jni_wrapper.cc:
##########
@@ -484,6 +493,56 @@ JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createScann
   JNI_METHOD_END(-1L)
 }
 
+/*
+ * Class:     org_apache_arrow_dataset_jni_JniWrapper
+ * Method:    createSubstraitScanner
+ * Signature: (JLjava/nio/ByteBuffer;JJ)J
+ */
+JNIEXPORT jlong JNICALL Java_org_apache_arrow_dataset_jni_JniWrapper_createSubstraitScanner(
+    JNIEnv* env, jobject, jlong dataset_id, jobject substrait_expr_produce_or_filter, jlong batch_size,
+    jlong memory_pool_id) {
+  JNI_METHOD_START
+  arrow::MemoryPool* pool = reinterpret_cast<arrow::MemoryPool*>(memory_pool_id);
+  if (pool == nullptr) {
+    JniThrow("Memory pool does not exist or has been closed");
+  }
+  std::shared_ptr<arrow::dataset::Dataset> dataset =
+      RetrieveNativeInstance<arrow::dataset::Dataset>(dataset_id);
+  std::shared_ptr<arrow::dataset::ScannerBuilder> scanner_builder =
+      JniGetOrThrow(dataset->NewScan());
+  JniAssertOkOrThrow(scanner_builder->Pool(pool));
+  if (substrait_expr_produce_or_filter != nullptr) {

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1307635080


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -58,6 +60,18 @@ public ScanOptions(long batchSize, Optional<String[]> columns) {
     this.columns = columns;
   }
 
+  /**
+   * Constructor.
+   * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+   * @param substraitExtendedExpression Extended expression to evaluate for project new columns or apply filter.
+   */
+  public ScanOptions(long batchSize, ByteBuffer substraitExtendedExpression) {

Review Comment:
   added



##########
java/dataset/src/main/java/org/apache/arrow/dataset/source/Dataset.java:
##########
@@ -32,4 +32,14 @@ public interface Dataset extends AutoCloseable {
    * @return the Scanner instance
    */
   Scanner newScan(ScanOptions options);
+
+  /**
+   * Create a new Scanner, using the provided options,
+   * that contains the binary representation of the Substrait
+   * Extended Expression.
+   *
+   * @param options options used during creating Scanner
+   * @return the Scanner instance
+   */
+  Scanner newSubstraitScan(ScanOptions options);

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1318014346


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test(expected = RuntimeException.class)
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionProject = Base64.getDecoder().decode(binarySubstraitExpressionProject);
+    ByteBuffer substraitExpressionProject = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionProject.length);
+    substraitExpressionProject.put(arrayByteSubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+         .substraitExpressionProjection(substraitExpressionProject)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {

Review Comment:
   This is due to the fact that ArrowReader references like DatasetFactory/Dataset/Scanner must be alive, so if we put all of them in try-catch, ArrowReader will be invalidated. We can create instance variables, but this adds more confusion in declare, assign values, and release these resources variables properly.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1318987546


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test(expected = RuntimeException.class)
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionProject = Base64.getDecoder().decode(binarySubstraitExpressionProject);
+    ByteBuffer substraitExpressionProject = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionProject.length);
+    substraitExpressionProject.put(arrayByteSubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+         .substraitExpressionProjection(substraitExpressionProject)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {

Review Comment:
   Makes sense, let's keep the tests the way you have it. I agree that sounds like the better option.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1319110093


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,173 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .substraitProjection(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+        assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
+            .equals("[value_19, value_1, value_11]"));
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .substraitProjection(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish()
+    ) {
+      Exception e = assertThrows(RuntimeException.class, () -> dataset.newScan(options));
+      assertTrue(e.getMessage().startsWith("Only one filter expression may be provided"));
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitProjection(Optional.of(substraitExpressionProject))
+        .substraitFilter(Optional.empty())
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString()
+            .equals("[21, 3, 13, 23, 47]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString()
+            .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11, " +
+                "value_21 - value_21, value_45 - value_45]"));
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    ByteBuffer substraitExpressionProject = getByteBuffer(binarySubstraitExpressionProject);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitProjection(Optional.of(substraitExpressionProject))
+        .substraitFilter(Optional.of(substraitExpressionFilter))
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        assertTrue(reader.getVectorSchemaRoot().getVector("add_two_to_column_a").toString()
+            .equals("[21, 3, 13]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("concat_column_a_and_b").toString()
+            .equals("[value_19 - value_19, value_1 - value_1, value_11 - value_11]"));
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  private static ByteBuffer getByteBuffer(String base64EncodedSubstrait) {
+    byte[] substraitFilter = Base64.getDecoder().decode(base64EncodedSubstrait);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(substraitFilter.length);

Review Comment:
   Don't forget about the rest of this function! e.g. `substraitExpressionFilter` and `substraitFilter`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1317322786


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,167 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test(expected = RuntimeException.class)
+  public void testBaseParquetReadWithExtendedExpressionsFilterException() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    // Expression 02: WHERE ID < 10
+    String binarySubstraitExpressionFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW5" +
+        "5X2FueRISGhAIAhACGgpsdDphbnlfYW55GjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKBQaF2ZpbHRlcl9pZF9sb3dlcl9" +
+        "0aGFuXzIwGjcKHBoaCAIaBAoCEAEiCBoGEgQKAhIAIgYaBAoCKAoaF2ZpbHRlcl9pZF9sb3dlcl90aGFuXzEwIhoKAklECgROQU1F" +
+        "Eg4KBCoCEAEKBGICEAEYAg==";
+    byte[] arrayByteSubstraitExpressionFilter = Base64.getDecoder().decode(binarySubstraitExpressionFilter);
+    ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionFilter.length);
+    substraitExpressionFilter.put(arrayByteSubstraitExpressionFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitExpressionFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(3, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProject() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("add_two_to_column_a", new ArrowType.Int(32, true)),
+        Field.nullable("concat_column_a_and_b", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Project New Column:
+    // Expression ADD: id + 2
+    // Expression CONCAT: name + '-' + name
+    String binarySubstraitExpressionProject = "Ch4IARIaL2Z1bmN0aW9uc19hcml0aG1ldGljLnlhbWwSERoPCAEaC2FkZDppM" +
+        "zJfaTMyEhQaEggCEAEaDGNvbmNhdDp2Y2hhchoxChoaGBoEKgIQASIIGgYSBAoCEgAiBhoECgIoAhoTYWRkX3R3b190b19jb2x1" +
+        "bW5fYRpGCi0aKwgBGgRiAhABIgoaCBIGCgQSAggBIgkaBwoFYgMgLSAiChoIEgYKBBICCAEaFWNvbmNhdF9jb2x1bW5fYV9hbmR" +
+        "fYiIaCgJJRAoETkFNRRIOCgQqAhABCgRiAhABGAI=";
+    byte[] arrayByteSubstraitExpressionProject = Base64.getDecoder().decode(binarySubstraitExpressionProject);
+    ByteBuffer substraitExpressionProject = ByteBuffer.allocateDirect(arrayByteSubstraitExpressionProject.length);
+    substraitExpressionProject.put(arrayByteSubstraitExpressionProject);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+         .substraitExpressionProjection(substraitExpressionProject)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+      }
+      assertEquals(5, rowcount);
+    }
+  }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsProjectAndFilter() throws Exception {

Review Comment:
   I see you added a helper function for base64 encoded filters. I was actually thinking of putting most of the test code in this helper function though like this (pseudo code):
   
   ```
   private ArrowReader scanParquetFileUsingSubstrait(Optional<String> base64EncodedSubstraitFilter, Optional<String> base64EncodedSubstraitProjection) throws Exception {
       final Schema schema = new Schema(Arrays.asList(
           Field.nullable("id", new ArrowType.Int(32, true)),
           Field.nullable("name", new ArrowType.Utf8())
       ), null);
       ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
       ByteBuffer substraitExpressionProjection = getByteBuffer(base64EncodedSubstraitProjection);
       ParquetWriteSupport writeSupport = ParquetWriteSupport
           .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
               11, "value_11", 21, "value_21", 45, "value_45");
       ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
           .columns(Optional.empty())
           .substraitFilter(substraitExpressionFilter)
           .substraitProjection(substraitExpressionProjection)
           .build();
       try (
           DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
               FileFormat.PARQUET, writeSupport.getOutputURI());
           Dataset dataset = datasetFactory.finish();
           Scanner scanner = dataset.newScan(options);
           ArrowReader reader = scanner.scanBatches()
       ) {
         assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
         return reader;
       }
     return null; // or raise 
   }
   
   
    @Test
     public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
       // Substrait Extended Expression: Filter:
       // Expression 01: WHERE ID < 20
       String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
           "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
           "BCgRiAhABGAI=";
       try (
           ArrowReader reader = scanParquetFileUsingSubstrait(base64EncodedSubstraitFilter, Optional.empty())
       ) {
         while (reader.loadNextBatch()) {
           assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
           assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
               .equals("[value_19, value_1, value_11]"));
         }
     }
   }
   
    @Test
   ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1312995270


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +73,106 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  private ByteBuffer getProjection() {
+    return projection;
+  }
+
+  private ByteBuffer getFilter() {
+    return filter;
+  }
+
+  private ByteBuffer getProjectionAndFilter() {
+    return projectionAndFilter;
+  }
+
+  /**
+   * To evaluate what option was used to define Substrait Extended Expression (Project/Filter).
+   *
+   * @return Substrait Extended Expression configured for project new columns and/or apply filter
+   */
+  public ByteBuffer getSubstraitExtendedExpression() {
+    if (getProjection() != null) {
+      return getProjection();
+    } else if (getFilter() != null) {
+      return getFilter();
+    } else if (getProjectionAndFilter() != null) {
+      return getProjectionAndFilter();
+    } else {
+      return null;
+    }
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private final Optional<String[]> columns;
+    private ByteBuffer projection;
+    private ByteBuffer filter;
+    private ByteBuffer projectionAndFilter;
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     * @param columns (Optional) Projected columns. {@link Optional#empty()} for scanning all columns. Otherwise,
+     *                Only columns present in the Array will be scanned.
+     */
+    public Builder(long batchSize, Optional<String[]> columns) {

Review Comment:
   Changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1323205000


##########
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java:
##########
@@ -69,4 +74,77 @@ public Optional<String[]> getColumns() {
   public long getBatchSize() {
     return batchSize;
   }
+
+  public Optional<ByteBuffer> getSubstraitProjection() {
+    return substraitProjection;
+  }
+
+  public Optional<ByteBuffer> getSubstraitFilter() {
+    return substraitFilter;
+  }
+
+  /**
+   * Builder for Options used during scanning.
+   */
+  public static class Builder {
+    private final long batchSize;
+    private Optional<String[]> columns;
+    private Optional<ByteBuffer> substraitProjection;
+    private Optional<ByteBuffer> substraitFilter;
+
+    /**
+     * Constructor.
+     * @param batchSize Maximum row number of each returned {@link org.apache.arrow.vector.ipc.message.ArrowRecordBatch}
+     */
+    public Builder(long batchSize) {
+      this.batchSize = batchSize;
+    }
+
+    /**
+     * Set the Projected columns. Empty for scanning all columns.
+     *
+     * @param columns Projected columns. Empty for scanning all columns.
+     * @return the ScanOptions configured.
+     */
+    public Builder columns(Optional<String[]> columns) {

Review Comment:
   The values for substraitProjection and substraitFilter have been changed.
   
   The rule definition for columns mentions that empty means scanning all columns, so that is how it works.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #35570: GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #35570:
URL: https://github.com/apache/arrow/pull/35570#discussion_r1326496961


##########
java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java:
##########
@@ -204,4 +205,170 @@ public void testRunBinaryQueryNamedTableNation() throws Exception {
       }
     }
   }
+
+  @Test
+  public void testBaseParquetReadWithExtendedExpressionsFilter() throws Exception {
+    final Schema schema = new Schema(Arrays.asList(
+        Field.nullable("id", new ArrowType.Int(32, true)),
+        Field.nullable("name", new ArrowType.Utf8())
+    ), null);
+    // Substrait Extended Expression: Filter:
+    // Expression 01: WHERE ID < 20
+    String base64EncodedSubstraitFilter = "Ch4IARIaL2Z1bmN0aW9uc19jb21wYXJpc29uLnlhbWwSEhoQCAIQAhoKbHQ6YW55X2F" +
+        "ueRo3ChwaGggCGgQKAhABIggaBhIECgISACIGGgQKAigUGhdmaWx0ZXJfaWRfbG93ZXJfdGhhbl8yMCIaCgJJRAoETkFNRRIOCgQqAhA" +
+        "BCgRiAhABGAI=";
+    ByteBuffer substraitExpressionFilter = getByteBuffer(base64EncodedSubstraitFilter);
+    ParquetWriteSupport writeSupport = ParquetWriteSupport
+        .writeTempFile(AVRO_SCHEMA_USER, TMP.newFolder(), 19, "value_19", 1, "value_1",
+            11, "value_11", 21, "value_21", 45, "value_45");
+    ScanOptions options = new ScanOptions.Builder(/*batchSize*/ 32768)
+        .columns(Optional.empty())
+        .substraitFilter(substraitExpressionFilter)
+        .build();
+    try (
+        DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
+            FileFormat.PARQUET, writeSupport.getOutputURI());
+        Dataset dataset = datasetFactory.finish();
+        Scanner scanner = dataset.newScan(options);
+        ArrowReader reader = scanner.scanBatches()
+    ) {
+      assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
+      int rowcount = 0;
+      while (reader.loadNextBatch()) {
+        rowcount += reader.getVectorSchemaRoot().getRowCount();
+        assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
+        assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
+            .equals("[value_19, value_1, value_11]"));

Review Comment:
   Just filled https://github.com/apache/arrow/issues/37728



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org