You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "davisusanibar (via GitHub)" <gi...@apache.org> on 2023/06/02 00:04:41 UTC

[GitHub] [arrow-cookbook] davisusanibar opened a new pull request, #310: GH-309: [Java] Initial Substrait Plan documentation (Query Dataset)

davisusanibar opened a new pull request, #310:
URL: https://github.com/apache/arrow-cookbook/pull/310

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-cookbook] lidavidm commented on a diff in pull request #310: GH-309: [Java] Initial Substrait Plan documentation (Query Dataset)

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm commented on code in PR #310:
URL: https://github.com/apache/arrow-cookbook/pull/310#discussion_r1214305271


##########
java/source/substrait.rst:
##########
@@ -0,0 +1,211 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _arrow-substrait:
+
+=========
+Substrait
+=========
+
+Arrow Java is using `Substrait`_ to leverage their integrations using standard
+specification to share messages between different layer y/o languages.

Review Comment:
   ```suggestion
   Arrow can use `Substrait`_ to integrate with other languages.
   ```



##########
java/source/substrait.rst:
##########
@@ -0,0 +1,211 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _arrow-substrait:
+
+=========
+Substrait
+=========
+
+Arrow Java is using `Substrait`_ to leverage their integrations using standard
+specification to share messages between different layer y/o languages.
+
+.. contents::
+
+Query Datasets
+==============
+
+Arrow :doc:`Java Dataset <dataset>` offer capabilities to read tabular data.
+For other side `Substrait Java`_ offer serialization Plan for Relational Algebra.
+Arrow Java Substrait is combined both of them to enable Querying data using
+`Acero`_ as a backend.
+
+Current `Acero`_ supported operations are:
+- Read
+- Filter
+- Project
+- Join
+- Aggregate

Review Comment:
   ```suggestion
   The Substrait support in Arrow combines :doc:`Dataset <dataset>` and
   `substrait-java`_ to query datasets using `Acero`_ as a backend.
   
   Acero currently supports:
   
   - Reading Arrow, CSV, ORC, and Parquet files
   - Filters
   - Projections
   - Joins
   - Aggregates
   ```



##########
java/source/substrait.rst:
##########
@@ -0,0 +1,211 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _arrow-substrait:
+
+=========
+Substrait
+=========
+
+Arrow Java is using `Substrait`_ to leverage their integrations using standard
+specification to share messages between different layer y/o languages.
+
+.. contents::
+
+Query Datasets
+==============

Review Comment:
   ```suggestion
   Querying Datasets
   =================
   ```



##########
java/source/substrait.rst:
##########
@@ -0,0 +1,211 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _arrow-substrait:
+
+=========
+Substrait
+=========
+
+Arrow Java is using `Substrait`_ to leverage their integrations using standard
+specification to share messages between different layer y/o languages.
+
+.. contents::
+
+Query Datasets
+==============
+
+Arrow :doc:`Java Dataset <dataset>` offer capabilities to read tabular data.
+For other side `Substrait Java`_ offer serialization Plan for Relational Algebra.
+Arrow Java Substrait is combined both of them to enable Querying data using
+`Acero`_ as a backend.
+
+Current `Acero`_ supported operations are:
+- Read
+- Filter
+- Project
+- Join
+- Aggregate
+
+Here is an example of a Java program that queries a Parquet file:
+
+.. testcode::
+
+    import com.google.common.collect.ImmutableList;
+    import io.substrait.isthmus.SqlToSubstrait;
+    import io.substrait.proto.Plan;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.dataset.substrait.AceroSubstraitConsumer;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+    import org.apache.calcite.sql.parser.SqlParseException;
+
+    import java.nio.ByteBuffer;
+    import java.util.HashMap;
+    import java.util.Map;
+
+    static Plan queryTableNation() throws SqlParseException {
+       String sql = "SELECT * FROM NATION WHERE N_NATIONKEY = 17";
+       String nation = "CREATE TABLE NATION (N_NATIONKEY BIGINT NOT NULL, N_NAME CHAR(25), " +
+               "N_REGIONKEY BIGINT NOT NULL, N_COMMENT VARCHAR(152))";
+       SqlToSubstrait sqlToSubstrait = new SqlToSubstrait();
+       Plan plan = sqlToSubstrait.execute(sql, ImmutableList.of(nation));
+       return plan;
+    }
+
+    static void queryDatasetThruSubstraitPlanDefinition() {
+       String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/tpch/nation.parquet";
+       ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
+       try (
+           BufferAllocator allocator = new RootAllocator();
+           DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+                   FileFormat.PARQUET, uri);
+           Dataset dataset = datasetFactory.finish();
+           Scanner scanner = dataset.newScan(options);
+           ArrowReader reader = scanner.scanBatches()
+       ) {
+           // map table to reader
+           Map<String, ArrowReader> mapTableToArrowReader = new HashMap<>();
+           mapTableToArrowReader.put("NATION", reader);
+           // get binary plan
+           Plan plan = queryTableNation();
+           ByteBuffer substraitPlan = ByteBuffer.allocateDirect(plan.toByteArray().length);
+           substraitPlan.put(plan.toByteArray());
+           // run query
+           try (ArrowReader arrowReader = new AceroSubstraitConsumer(allocator).runQuery(
+               substraitPlan,
+               mapTableToArrowReader
+           )) {
+               while (arrowReader.loadNextBatch()) {
+                   System.out.print(arrowReader.getVectorSchemaRoot().contentToTSVString());
+               }
+           }
+       } catch (Exception e) {
+           e.printStackTrace();
+       }
+    }
+
+    queryDatasetThruSubstraitPlanDefinition();
+
+.. testoutput::
+
+    N_NATIONKEY    N_NAME    N_REGIONKEY    N_COMMENT
+    17    PERU    1    platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun
+
+It is also possible to query multiple datasets and joining then based on some criteria.
+Let's query for example the following datasets: TPCH Nation and TPCH Customer
+
+.. testcode::
+
+    import com.google.common.collect.ImmutableList;
+    import io.substrait.isthmus.SqlToSubstrait;
+    import io.substrait.proto.Plan;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.dataset.substrait.AceroSubstraitConsumer;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+    import org.apache.calcite.sql.parser.SqlParseException;
+
+    import java.nio.ByteBuffer;
+    import java.util.HashMap;
+    import java.util.Map;
+
+    static Plan queryTableNationJoinCustomer() throws SqlParseException {
+        String sql = "SELECT n.n_name, COUNT(*) AS NUMBER_CUSTOMER FROM NATION n JOIN CUSTOMER c " +
+            "ON n.n_nationkey = c.c_nationkey WHERE n.n_nationkey = 17 " +
+            "GROUP BY n.n_name";
+        String nation = "CREATE TABLE NATION (N_NATIONKEY BIGINT NOT NULL, " +
+            "N_NAME CHAR(25), N_REGIONKEY BIGINT NOT NULL, N_COMMENT VARCHAR(152))";
+        String customer = "CREATE TABLE CUSTOMER (C_CUSTKEY BIGINT NOT NULL, " +
+            "C_NAME VARCHAR(25), C_ADDRESS VARCHAR(40), C_NATIONKEY BIGINT NOT NULL, " +
+            "C_PHONE CHAR(15), C_ACCTBAL DECIMAL, C_MKTSEGMENT CHAR(10), " +
+            "C_COMMENT VARCHAR(117) )";
+        SqlToSubstrait sqlToSubstrait = new SqlToSubstrait();
+        Plan plan = sqlToSubstrait.execute(sql,
+            ImmutableList.of(nation, customer));
+        return plan;
+    }
+
+    static void queryTwoDatasetsThruSubstraitPlanDefinition() {
+        String uriNation = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/tpch/nation.parquet";
+        String uriCustomer = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/tpch/customer.parquet";
+        ScanOptions optionsNations = new ScanOptions(/*batchSize*/ 32768);

Review Comment:
   Just use the same options for both?



##########
java/source/substrait.rst:
##########
@@ -0,0 +1,211 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _arrow-substrait:
+
+=========
+Substrait
+=========
+
+Arrow Java is using `Substrait`_ to leverage their integrations using standard
+specification to share messages between different layer y/o languages.
+
+.. contents::
+
+Query Datasets
+==============
+
+Arrow :doc:`Java Dataset <dataset>` offer capabilities to read tabular data.
+For other side `Substrait Java`_ offer serialization Plan for Relational Algebra.
+Arrow Java Substrait is combined both of them to enable Querying data using
+`Acero`_ as a backend.
+
+Current `Acero`_ supported operations are:
+- Read
+- Filter
+- Project
+- Join
+- Aggregate
+
+Here is an example of a Java program that queries a Parquet file:
+
+.. testcode::
+
+    import com.google.common.collect.ImmutableList;
+    import io.substrait.isthmus.SqlToSubstrait;
+    import io.substrait.proto.Plan;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.dataset.substrait.AceroSubstraitConsumer;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+    import org.apache.calcite.sql.parser.SqlParseException;
+
+    import java.nio.ByteBuffer;
+    import java.util.HashMap;
+    import java.util.Map;
+
+    static Plan queryTableNation() throws SqlParseException {
+       String sql = "SELECT * FROM NATION WHERE N_NATIONKEY = 17";
+       String nation = "CREATE TABLE NATION (N_NATIONKEY BIGINT NOT NULL, N_NAME CHAR(25), " +
+               "N_REGIONKEY BIGINT NOT NULL, N_COMMENT VARCHAR(152))";
+       SqlToSubstrait sqlToSubstrait = new SqlToSubstrait();
+       Plan plan = sqlToSubstrait.execute(sql, ImmutableList.of(nation));
+       return plan;
+    }
+
+    static void queryDatasetThruSubstraitPlanDefinition() {
+       String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/tpch/nation.parquet";
+       ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
+       try (
+           BufferAllocator allocator = new RootAllocator();
+           DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+                   FileFormat.PARQUET, uri);
+           Dataset dataset = datasetFactory.finish();
+           Scanner scanner = dataset.newScan(options);
+           ArrowReader reader = scanner.scanBatches()
+       ) {
+           // map table to reader

Review Comment:
   ```suggestion
   ```



##########
java/source/substrait.rst:
##########
@@ -0,0 +1,211 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _arrow-substrait:
+
+=========
+Substrait
+=========
+
+Arrow Java is using `Substrait`_ to leverage their integrations using standard
+specification to share messages between different layer y/o languages.
+
+.. contents::
+
+Query Datasets
+==============
+
+Arrow :doc:`Java Dataset <dataset>` offer capabilities to read tabular data.
+For other side `Substrait Java`_ offer serialization Plan for Relational Algebra.
+Arrow Java Substrait is combined both of them to enable Querying data using
+`Acero`_ as a backend.
+
+Current `Acero`_ supported operations are:
+- Read
+- Filter
+- Project
+- Join
+- Aggregate
+
+Here is an example of a Java program that queries a Parquet file:
+
+.. testcode::
+
+    import com.google.common.collect.ImmutableList;
+    import io.substrait.isthmus.SqlToSubstrait;
+    import io.substrait.proto.Plan;
+    import org.apache.arrow.dataset.file.FileFormat;
+    import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+    import org.apache.arrow.dataset.jni.NativeMemoryPool;
+    import org.apache.arrow.dataset.scanner.ScanOptions;
+    import org.apache.arrow.dataset.scanner.Scanner;
+    import org.apache.arrow.dataset.source.Dataset;
+    import org.apache.arrow.dataset.source.DatasetFactory;
+    import org.apache.arrow.dataset.substrait.AceroSubstraitConsumer;
+    import org.apache.arrow.memory.BufferAllocator;
+    import org.apache.arrow.memory.RootAllocator;
+    import org.apache.arrow.vector.ipc.ArrowReader;
+    import org.apache.calcite.sql.parser.SqlParseException;
+
+    import java.nio.ByteBuffer;
+    import java.util.HashMap;
+    import java.util.Map;
+
+    static Plan queryTableNation() throws SqlParseException {
+       String sql = "SELECT * FROM NATION WHERE N_NATIONKEY = 17";
+       String nation = "CREATE TABLE NATION (N_NATIONKEY BIGINT NOT NULL, N_NAME CHAR(25), " +
+               "N_REGIONKEY BIGINT NOT NULL, N_COMMENT VARCHAR(152))";
+       SqlToSubstrait sqlToSubstrait = new SqlToSubstrait();
+       Plan plan = sqlToSubstrait.execute(sql, ImmutableList.of(nation));
+       return plan;
+    }
+
+    static void queryDatasetThruSubstraitPlanDefinition() {
+       String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/tpch/nation.parquet";
+       ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
+       try (
+           BufferAllocator allocator = new RootAllocator();
+           DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+                   FileFormat.PARQUET, uri);
+           Dataset dataset = datasetFactory.finish();
+           Scanner scanner = dataset.newScan(options);
+           ArrowReader reader = scanner.scanBatches()
+       ) {
+           // map table to reader
+           Map<String, ArrowReader> mapTableToArrowReader = new HashMap<>();
+           mapTableToArrowReader.put("NATION", reader);
+           // get binary plan
+           Plan plan = queryTableNation();
+           ByteBuffer substraitPlan = ByteBuffer.allocateDirect(plan.toByteArray().length);
+           substraitPlan.put(plan.toByteArray());
+           // run query
+           try (ArrowReader arrowReader = new AceroSubstraitConsumer(allocator).runQuery(
+               substraitPlan,
+               mapTableToArrowReader
+           )) {
+               while (arrowReader.loadNextBatch()) {
+                   System.out.print(arrowReader.getVectorSchemaRoot().contentToTSVString());
+               }
+           }
+       } catch (Exception e) {
+           e.printStackTrace();
+       }
+    }
+
+    queryDatasetThruSubstraitPlanDefinition();
+
+.. testoutput::
+
+    N_NATIONKEY    N_NAME    N_REGIONKEY    N_COMMENT
+    17    PERU    1    platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun
+
+It is also possible to query multiple datasets and joining then based on some criteria.
+Let's query for example the following datasets: TPCH Nation and TPCH Customer

Review Comment:
   ```suggestion
   It is also possible to query multiple datasets and join them based on some criteria.
   For example, we can join the nation and customer tables from the TPC-H benchmark:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-cookbook] lidavidm merged pull request #310: GH-309: [Java] Initial Substrait Plan documentation (Query Dataset)

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.

lidavidm merged PR #310:
URL: https://github.com/apache/arrow-cookbook/pull/310


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org