You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by li...@apache.org on 2022/02/22 13:02:02 UTC
[arrow-cookbook] branch main updated: [Java] Java cookbook for create arrow jni dataset (#138)
This is an automated email from the ASF dual-hosted git repository.
lidavidm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git
The following commit(s) were added to refs/heads/main by this push:
new 97eb69c [Java] Java cookbook for create arrow jni dataset (#138)
97eb69c is described below
commit 97eb69c2c2d55f2f7c660ff0bb43d3795867f56b
Author: david dali susanibar arce <da...@gmail.com>
AuthorDate: Tue Feb 22 08:01:56 2022 -0500
[Java] Java cookbook for create arrow jni dataset (#138)
* Adding java cookbook for creating arrow jni
* JNI library dependencies
* Adding java cookbook for creating arrow jni
* Adding java cookbook for creating arrow jni
* Adding java cookbook for creating arrow jni
* Testing problem with download dependencies
* Debug jni errors
* Debug jni errors
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Solving jni errors for jni *.dylib and *.so library dependencies
* Custom protobuf.rb formula
* Adding parquet files
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Configure ci worflow for jni and not jni
* Adding github cache for protobuf lib
* Adding github cache for protobuf lib
* Adding github cache for protobuf lib
* Adding github cache for protobuf lib
* Adding github cache for protobuf lib
* Adding github cache for protobuf lib
* Adding github cache for protobuf lib
* Adding github cache for protobuf lib
* Adding github cache for protobuf lib
* Adding github cache for protobuf lib
* Adding JNI testing cookbooks
* Arrow jni dataset for version 7.0.0
* Arrow jni dataset for version 7.0.0
* Solving error: Failed to collect dependencies
* Solving error: Failed to collect dependencies
* Update java/source/dataset.rst
Co-authored-by: David Li <li...@gmail.com>
* Solving pr comments
* Update java/source/dataset.rst
Co-authored-by: David Li <li...@gmail.com>
* Update java/source/dataset.rst
Co-authored-by: David Li <li...@gmail.com>
* Update java/source/dataset.rst
Co-authored-by: David Li <li...@gmail.com>
* Update java/source/dataset.rst
Co-authored-by: David Li <li...@gmail.com>
* Solving pr comments
* Solving pr comments
* Solving pr comments
* Solving pr comments
Co-authored-by: David Li <li...@gmail.com>
---
...a_cookbook.yml => test_java_linux_cookbook.yml} | 8 +-
...ava_cookbook.yml => test_java_osx_cookbook.yml} | 19 +-
java/ext/javadoctest.py | 1 -
java/source/dataset.rst | 277 +++++++++++++++++++++
java/source/demo/pom.xml | 20 +-
java/source/index.rst | 1 +
java/thirdpartydeps/parquetfiles/data1.parquet | Bin 0 -> 687 bytes
java/thirdpartydeps/parquetfiles/data2.parquet | Bin 0 -> 690 bytes
java/thirdpartydeps/parquetfiles/data3.parquet | Bin 0 -> 4569 bytes
9 files changed, 298 insertions(+), 28 deletions(-)
diff --git a/.github/workflows/test_java_cookbook.yml b/.github/workflows/test_java_linux_cookbook.yml
similarity index 89%
copy from .github/workflows/test_java_cookbook.yml
copy to .github/workflows/test_java_linux_cookbook.yml
index 8f211d5..539afd0 100644
--- a/.github/workflows/test_java_cookbook.yml
+++ b/.github/workflows/test_java_linux_cookbook.yml
@@ -15,7 +15,7 @@
# specific language governing permissions and limitations
# under the License.
-name: Test Java Cookbook
+name: Test Java Cookbook On Linux
on:
pull_request:
@@ -23,15 +23,15 @@ on:
- main
paths:
- "java/**"
- - ".github/workflows/test_java_cookbook.yml"
+ - ".github/workflows/test_java_linux_cookbook.yml"
concurrency:
group: ${{ github.repository }}-${{ github.ref }}-${{ github.workflow }}
cancel-in-progress: true
jobs:
- test_py:
- name: "Test Java Cookbook"
+ test_java_linux:
+ name: "Test Java Cookbook On Linux"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v1
diff --git a/.github/workflows/test_java_cookbook.yml b/.github/workflows/test_java_osx_cookbook.yml
similarity index 75%
rename from .github/workflows/test_java_cookbook.yml
rename to .github/workflows/test_java_osx_cookbook.yml
index 8f211d5..a55dd68 100644
--- a/.github/workflows/test_java_cookbook.yml
+++ b/.github/workflows/test_java_osx_cookbook.yml
@@ -15,7 +15,7 @@
# specific language governing permissions and limitations
# under the License.
-name: Test Java Cookbook
+name: Test Java Cookbook on MacOS
on:
pull_request:
@@ -23,22 +23,25 @@ on:
- main
paths:
- "java/**"
- - ".github/workflows/test_java_cookbook.yml"
+ - ".github/workflows/test_java_osx_cookbook.yml"
concurrency:
group: ${{ github.repository }}-${{ github.ref }}-${{ github.workflow }}
cancel-in-progress: true
jobs:
- test_py:
- name: "Test Java Cookbook"
- runs-on: ubuntu-latest
+ test_java_osx:
+ name: "Test Java Cookbook on MacOS"
+ runs-on: macos-latest
steps:
- uses: actions/checkout@v1
- - name: Install dependencies
- run: sudo apt install libcurl4-openssl-dev libssl-dev python3-pip openjdk-11-jdk maven
+ - uses: actions/setup-java@v2
+ with:
+ distribution: 'temurin'
+ java-version: '11'
+ - name: Upgrade pip
+ run: python3 -m pip install --upgrade pip
- name: Run tests
run: make javatest
- name: Build cookbook
run: make java
-
diff --git a/java/ext/javadoctest.py b/java/ext/javadoctest.py
index 4b39817..1a55dd5 100644
--- a/java/ext/javadoctest.py
+++ b/java/ext/javadoctest.py
@@ -23,7 +23,6 @@ class JavaDocTestBuilder(DocTestBuilder):
) -> Any:
# go to project that contains all your arrow maven dependencies
path_arrow_project = pathlib.Path(__file__).parent.parent / "source" / "demo"
-
# create list of all arrow jar dependencies
subprocess.check_call(
[
diff --git a/java/source/dataset.rst b/java/source/dataset.rst
new file mode 100644
index 0000000..ecf2bb3
--- /dev/null
+++ b/java/source/dataset.rst
@@ -0,0 +1,277 @@
+.. _arrow-dataset:
+
+=======
+Dataset
+=======
+
+* `Arrow Java Dataset`_: Java implementation of Arrow Datasets library. Implement Dataset Java API by JNI to C++.
+
+.. contents::
+
+Constructing Datasets
+=====================
+
+We can construct a dataset with an auto-inferred schema.
+
+.. testcode::
+
+ import org.apache.arrow.dataset.file.FileFormat;
+ import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+ import org.apache.arrow.dataset.jni.NativeMemoryPool;
+ import org.apache.arrow.dataset.scanner.ScanOptions;
+ import org.apache.arrow.dataset.scanner.Scanner;
+ import org.apache.arrow.dataset.source.Dataset;
+ import org.apache.arrow.dataset.source.DatasetFactory;
+ import org.apache.arrow.memory.RootAllocator;
+ import java.util.stream.StreamSupport;
+
+ try (RootAllocator rootAllocator = new RootAllocator(Long.MAX_VALUE)) {
+ String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/parquetfiles/data1.parquet";
+ try (DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri)) {
+ try(Dataset dataset = datasetFactory.finish()){
+ ScanOptions options = new ScanOptions(/*batchSize*/ 100);
+ try(Scanner scanner = dataset.newScan(options)){
+ System.out.println(StreamSupport.stream(scanner.scan().spliterator(), false).count());
+ }
+ }
+ }
+ }
+
+.. testoutput::
+
+ 1
+
+Let construct our dataset with predefined schema.
+
+.. testcode::
+
+ import org.apache.arrow.dataset.file.FileFormat;
+ import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+ import org.apache.arrow.dataset.jni.NativeMemoryPool;
+ import org.apache.arrow.dataset.scanner.ScanOptions;
+ import org.apache.arrow.dataset.scanner.Scanner;
+ import org.apache.arrow.dataset.source.Dataset;
+ import org.apache.arrow.dataset.source.DatasetFactory;
+ import org.apache.arrow.memory.RootAllocator;
+ import java.util.stream.StreamSupport;
+
+ String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/parquetfiles/data1.parquet";
+ try (RootAllocator rootAllocator = new RootAllocator(Long.MAX_VALUE)) {
+ try (DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri)) {
+ try(Dataset dataset = datasetFactory.finish(datasetFactory.inspect())){
+ ScanOptions options = new ScanOptions(/*batchSize*/ 100);
+ try(Scanner scanner = dataset.newScan(options)){
+ System.out.println(StreamSupport.stream(scanner.scan().spliterator(), false).count());
+ }
+ }
+ }
+ }
+
+.. testoutput::
+
+ 1
+
+Getting the Schema
+==================
+
+During Dataset Construction
+***************************
+
+.. testcode::
+
+ import org.apache.arrow.dataset.file.FileFormat;
+ import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+ import org.apache.arrow.dataset.jni.NativeMemoryPool;
+ import org.apache.arrow.dataset.source.DatasetFactory;
+ import org.apache.arrow.memory.RootAllocator;
+ import org.apache.arrow.vector.types.pojo.Schema;
+
+ String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/parquetfiles/data1.parquet";
+ try(RootAllocator rootAllocator = new RootAllocator(Long.MAX_VALUE)){
+ try(DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri)){
+ Schema schema = datasetFactory.inspect();
+
+ System.out.println(schema);
+ }
+ }
+
+.. testoutput::
+
+ Schema<id: Int(32, true), name: Utf8>(metadata: {parquet.avro.schema={"type":"record","name":"User","namespace":"org.apache.arrow.dataset","fields":[{"name":"id","type":["int","null"]},{"name":"name","type":["string","null"]}]}, writer.model.name=avro})
+
+From a Dataset
+**************
+
+.. testcode::
+
+ import org.apache.arrow.dataset.file.FileFormat;
+ import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+ import org.apache.arrow.dataset.jni.NativeMemoryPool;
+ import org.apache.arrow.dataset.scanner.ScanOptions;
+ import org.apache.arrow.dataset.scanner.Scanner;
+ import org.apache.arrow.dataset.source.Dataset;
+ import org.apache.arrow.dataset.source.DatasetFactory;
+ import org.apache.arrow.memory.RootAllocator;
+ import org.apache.arrow.vector.types.pojo.Schema;
+
+ String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/parquetfiles/data1.parquet";
+ try(RootAllocator rootAllocator = new RootAllocator(Long.MAX_VALUE)){
+ try(DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri)){
+ ScanOptions options = new ScanOptions(/*batchSize*/ 1);
+ try(Dataset dataset = datasetFactory.finish()){
+ try(Scanner scanner = dataset.newScan(options)){
+ Schema schema = scanner.schema();
+
+ System.out.println(schema);
+ }
+ }
+ }
+ }
+
+.. testoutput::
+
+ Schema<id: Int(32, true), name: Utf8>(metadata: {parquet.avro.schema={"type":"record","name":"User","namespace":"org.apache.arrow.dataset","fields":[{"name":"id","type":["int","null"]},{"name":"name","type":["string","null"]}]}, writer.model.name=avro})
+
+Query Parquet File
+==================
+
+Let query information for a parquet file.
+
+Query Data Content For File
+***************************
+
+.. testcode::
+
+ import org.apache.arrow.dataset.file.FileFormat;
+ import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+ import org.apache.arrow.dataset.jni.NativeMemoryPool;
+ import org.apache.arrow.dataset.scanner.ScanOptions;
+ import org.apache.arrow.dataset.scanner.Scanner;
+ import org.apache.arrow.dataset.source.Dataset;
+ import org.apache.arrow.dataset.source.DatasetFactory;
+ import org.apache.arrow.memory.RootAllocator;
+ import org.apache.arrow.vector.VectorLoader;
+ import org.apache.arrow.vector.VectorSchemaRoot;
+
+ import java.util.stream.Stream;
+
+ String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/parquetfiles/data1.parquet";
+ try(RootAllocator rootAllocator = new RootAllocator(Long.MAX_VALUE);
+ DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
+ Dataset dataset = datasetFactory.finish()){
+ ScanOptions options = new ScanOptions(/*batchSize*/ 100);
+ try(Scanner scanner = dataset.newScan(options);
+ VectorSchemaRoot vsr = VectorSchemaRoot.create(scanner.schema(), rootAllocator)){
+ scanner.scan().forEach(scanTask-> {
+ VectorLoader loader = new VectorLoader(vsr);
+ scanTask.execute().forEachRemaining(arrowRecordBatch -> {
+ loader.load(arrowRecordBatch);
+ System.out.print(vsr.contentToTSVString());
+ arrowRecordBatch.close();
+ });
+ });
+ }
+ }
+
+.. testoutput::
+
+ id name
+ 1 David
+ 2 Gladis
+ 3 Juan
+
+Query Data Content For Directory
+********************************
+
+Consider that we have these files: data1: 3 rows, data2: 3 rows and data3: 250 rows.
+
+.. testcode::
+
+ import org.apache.arrow.dataset.file.FileFormat;
+ import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+ import org.apache.arrow.dataset.jni.NativeMemoryPool;
+ import org.apache.arrow.dataset.scanner.ScanOptions;
+ import org.apache.arrow.dataset.scanner.Scanner;
+ import org.apache.arrow.dataset.source.Dataset;
+ import org.apache.arrow.dataset.source.DatasetFactory;
+ import org.apache.arrow.memory.RootAllocator;
+ import org.apache.arrow.vector.VectorLoader;
+ import org.apache.arrow.vector.VectorSchemaRoot;
+
+ import java.util.stream.Stream;
+
+ String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/parquetfiles/";
+ try(RootAllocator rootAllocator = new RootAllocator(Long.MAX_VALUE);
+ DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
+ Dataset dataset = datasetFactory.finish()){
+ ScanOptions options = new ScanOptions(/*batchSize*/ 100);
+ try(Scanner scanner = dataset.newScan(options);
+ VectorSchemaRoot vsr = VectorSchemaRoot.create(scanner.schema(), rootAllocator)){
+ scanner.scan().forEach(scanTask-> {
+ VectorLoader loader = new VectorLoader(vsr);
+ final int[] count = {1};
+ scanTask.execute().forEachRemaining(arrowRecordBatch -> {
+ loader.load(arrowRecordBatch);
+ System.out.println("Batch: " + count[0]++ + ", RowCount: " + vsr.getRowCount());
+ arrowRecordBatch.close();
+ });
+ });
+ }
+ }
+
+.. testoutput::
+
+ Batch: 1, RowCount: 3
+ Batch: 2, RowCount: 3
+ Batch: 3, RowCount: 100
+ Batch: 4, RowCount: 100
+ Batch: 5, RowCount: 50
+
+Query Data Content with Projection
+**********************************
+
+In case we need to project only certain columns we could configure ScanOptions with projections needed.
+
+.. testcode::
+
+ import org.apache.arrow.dataset.file.FileFormat;
+ import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
+ import org.apache.arrow.dataset.jni.NativeMemoryPool;
+ import org.apache.arrow.dataset.scanner.ScanOptions;
+ import org.apache.arrow.dataset.scanner.Scanner;
+ import org.apache.arrow.dataset.source.Dataset;
+ import org.apache.arrow.dataset.source.DatasetFactory;
+ import org.apache.arrow.memory.RootAllocator;
+ import org.apache.arrow.vector.VectorLoader;
+ import org.apache.arrow.vector.VectorSchemaRoot;
+
+ import java.util.Optional;
+
+ String uri = "file:" + System.getProperty("user.dir") + "/thirdpartydeps/parquetfiles/data1.parquet";
+ try(RootAllocator rootAllocator = new RootAllocator(Long.MAX_VALUE);
+ DatasetFactory datasetFactory = new FileSystemDatasetFactory(rootAllocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
+ Dataset dataset = datasetFactory.finish()){
+ String[] projection = new String[] {"name"};
+ ScanOptions options = new ScanOptions(/*batchSize*/ 100, Optional.of(projection));
+ try(Scanner scanner = dataset.newScan(options);
+ VectorSchemaRoot vsr = VectorSchemaRoot.create(scanner.schema(), rootAllocator)){
+ scanner.scan().forEach(scanTask-> {
+ VectorLoader loader = new VectorLoader(vsr);
+ scanTask.execute().forEachRemaining(arrowRecordBatch -> {
+ loader.load(arrowRecordBatch);
+ System.out.print(vsr.contentToTSVString());
+ arrowRecordBatch.close();
+ });
+ });
+ }
+ }
+
+.. testoutput::
+
+ name
+ David
+ Gladis
+ Juan
+
+
+.. _Arrow Java Dataset: https://arrow.apache.org/docs/dev/java/dataset.html
\ No newline at end of file
diff --git a/java/source/demo/pom.xml b/java/source/demo/pom.xml
index 2f4305d..41d10d5 100644
--- a/java/source/demo/pom.xml
+++ b/java/source/demo/pom.xml
@@ -21,7 +21,7 @@
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
- <arrow.version>6.0.0</arrow.version>
+ <arrow.version>7.0.0</arrow.version>
</properties>
<dependencies>
@@ -42,23 +42,13 @@
</dependency>
<dependency>
<groupId>org.apache.arrow</groupId>
- <artifactId>flight-core</artifactId>
+ <artifactId>arrow-dataset</artifactId>
<version>${arrow.version}</version>
- <exclusions>
- <exclusion>
- <groupId>io.netty</groupId>
- <artifactId>netty-transport-native-unix-common</artifactId>
- </exclusion>
- <exclusion>
- <groupId>io.netty</groupId>
- <artifactId>netty-transport-native-kqueue</artifactId>
- </exclusion>
- </exclusions>
</dependency>
<dependency>
- <groupId>junit</groupId>
- <artifactId>junit</artifactId>
- <version>4.13.2</version>
+ <groupId>com.google.guava</groupId>
+ <artifactId>guava</artifactId>
+ <version>30.1.1-jre</version>
</dependency>
</dependencies>
diff --git a/java/source/index.rst b/java/source/index.rst
index 17b87c3..38d7bf7 100644
--- a/java/source/index.rst
+++ b/java/source/index.rst
@@ -14,6 +14,7 @@ Welcome to java arrow's documentation!
io
schema
data
+ dataset
Indices and tables
==================
diff --git a/java/thirdpartydeps/parquetfiles/data1.parquet b/java/thirdpartydeps/parquetfiles/data1.parquet
new file mode 100644
index 0000000..a2602db
Binary files /dev/null and b/java/thirdpartydeps/parquetfiles/data1.parquet differ
diff --git a/java/thirdpartydeps/parquetfiles/data2.parquet b/java/thirdpartydeps/parquetfiles/data2.parquet
new file mode 100644
index 0000000..0adc5eb
Binary files /dev/null and b/java/thirdpartydeps/parquetfiles/data2.parquet differ
diff --git a/java/thirdpartydeps/parquetfiles/data3.parquet b/java/thirdpartydeps/parquetfiles/data3.parquet
new file mode 100644
index 0000000..958edfd
Binary files /dev/null and b/java/thirdpartydeps/parquetfiles/data3.parquet differ