You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/15 13:41:07 UTC

[GitHub] [arrow] zhztheplayer opened a new pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

zhztheplayer opened a new pull request #10333:
URL: https://github.com/apache/arrow/pull/10333


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r632973338



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is
+supposed to be located from a traditional file system, but Dataset API is not designed only for querying files but can be extended to serve all possible data sources
+such as from inter-process communication or from other network locations, etc. 
+
+Getting Started
+===========
+
+Below shows a simplest example of using Dataset to query a Parquet file in Java:
+
+.. code-block:: Java
+
+    // read data from file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    Dataset dataset = factory.finish();
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[0], 100));
+    List<ArrowRecordBatch> batches = StreamSupport.stream(scanner.scan().spliterator(), false)
+        .flatMap(t -> stream(t.execute()))
+        .collect(Collectors.toList());
+    
+    // do something with read record batches, for example:
+    analyzeArrowData(batches);
+    
+    // finished the analysis of the data, close all resources:
+    AutoCloseables.close(batches);
+    AutoCloseables.close(factory, dataset, scanner);
+
+.. note::
+    ``ArrowRecordBatch`` is a low-level composite Arrow data exchange format that doesn't provide API to read typed data from it directly. It's recommended
+    to use utilities ``VectorLoader`` to load it into a schema aware container ``VectorSchemaRoot`` by which user could be able to access decoded data
+    conveniently in Java.
+
+.. seealso::
+   Load record batches with :doc:`VectorSchemaRoot <vector_schema_root>`.
+
+Schema
+===========
+
+Schema of the data to be queried can be inspected via method ``DatasetFactory#inspect()`` before actually reading it. For example:
+
+.. code-block:: Java
+
+    // read data from local file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    
+    // inspect schema
+    Schema schema = factory.inspect();
+ 
+For some of the data format that is compatible with a user-defined schema, user can use method ``DatasetFactory#inspect(Schema schema)`` to create the dataset:
+
+.. code-block:: Java
+
+    Schema schema = createUserSchema()
+    Dataset dataset = factory.finish(schema);
+
+Otherwise when the non-parameter method ``DatasetFactory#inspect()`` is called, schema will be inferred automatically from data source. The same as the result of
+``DatasetFactory#inspect()``.
+
+Also, if projector is specified during scanning (see next section :ref:`Projection`), the actual schema of output data can be got within method ``Scanner::schema()``:
+
+.. code-block:: Java
+
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[] {"id", "name"}, 100));
+    Schema projectedSchema = scanner.schema();
+ 
+Projection
+===========
+
+User can specify projections in ScanOptions. For ``FileSystemDataset``, only column projection is allowed for now. Which means, only column names
+in the projection list will be accepted. For example:
+
+.. code-block:: Java
+
+    String[] projection = new String[] {"id", "name"};
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+If no projection is needed, specify an empty String array ``new String[0]`` in ScanOptions:
+
+.. code-block:: Java
+
+    String[] projection = new String[0];
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+This way all column will be emitted during scanning.
+
+Read data from HDFS
+===========
+
+``FileSystemDataset`` supports reading data from non-local file systems. HDFS support is included in the official Apache Arrow Java package releases and
+can be used directly without re-building the source code.
+To access HDFS data using Dataset API, pass a general HDFS URI to ``FilesSystemDatasetFactory``:
+
+.. code-block:: Java
+    
+    String uri = "hdfs://{hdfs_host}:{port}/data/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+        
+Native Memory Management
+===========
+
+To gain better performance and reducing code complexity, Java ``FileSystemDataset`` internally relys on C++ ``arrow::dataset::FileSystemDataset`` via JNI.
+As a result, All Arrow data read from ``FileSystemDataset`` is supposed to be allocated off the JVM heap. To manage this part of memory, An utility class

Review comment:
       All -> all




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r632973438



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is
+supposed to be located from a traditional file system, but Dataset API is not designed only for querying files but can be extended to serve all possible data sources
+such as from inter-process communication or from other network locations, etc. 
+
+Getting Started
+===========
+
+Below shows a simplest example of using Dataset to query a Parquet file in Java:
+
+.. code-block:: Java
+
+    // read data from file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    Dataset dataset = factory.finish();
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[0], 100));
+    List<ArrowRecordBatch> batches = StreamSupport.stream(scanner.scan().spliterator(), false)
+        .flatMap(t -> stream(t.execute()))
+        .collect(Collectors.toList());
+    
+    // do something with read record batches, for example:
+    analyzeArrowData(batches);
+    
+    // finished the analysis of the data, close all resources:
+    AutoCloseables.close(batches);
+    AutoCloseables.close(factory, dataset, scanner);
+
+.. note::
+    ``ArrowRecordBatch`` is a low-level composite Arrow data exchange format that doesn't provide API to read typed data from it directly. It's recommended
+    to use utilities ``VectorLoader`` to load it into a schema aware container ``VectorSchemaRoot`` by which user could be able to access decoded data
+    conveniently in Java.
+
+.. seealso::
+   Load record batches with :doc:`VectorSchemaRoot <vector_schema_root>`.
+
+Schema
+===========
+
+Schema of the data to be queried can be inspected via method ``DatasetFactory#inspect()`` before actually reading it. For example:
+
+.. code-block:: Java
+
+    // read data from local file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    
+    // inspect schema
+    Schema schema = factory.inspect();
+ 
+For some of the data format that is compatible with a user-defined schema, user can use method ``DatasetFactory#inspect(Schema schema)`` to create the dataset:
+
+.. code-block:: Java
+
+    Schema schema = createUserSchema()
+    Dataset dataset = factory.finish(schema);
+
+Otherwise when the non-parameter method ``DatasetFactory#inspect()`` is called, schema will be inferred automatically from data source. The same as the result of
+``DatasetFactory#inspect()``.
+
+Also, if projector is specified during scanning (see next section :ref:`Projection`), the actual schema of output data can be got within method ``Scanner::schema()``:
+
+.. code-block:: Java
+
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[] {"id", "name"}, 100));
+    Schema projectedSchema = scanner.schema();
+ 
+Projection
+===========
+
+User can specify projections in ScanOptions. For ``FileSystemDataset``, only column projection is allowed for now. Which means, only column names
+in the projection list will be accepted. For example:
+
+.. code-block:: Java
+
+    String[] projection = new String[] {"id", "name"};
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+If no projection is needed, specify an empty String array ``new String[0]`` in ScanOptions:
+
+.. code-block:: Java
+
+    String[] projection = new String[0];
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+This way all column will be emitted during scanning.
+
+Read data from HDFS
+===========
+
+``FileSystemDataset`` supports reading data from non-local file systems. HDFS support is included in the official Apache Arrow Java package releases and
+can be used directly without re-building the source code.
+To access HDFS data using Dataset API, pass a general HDFS URI to ``FilesSystemDatasetFactory``:
+
+.. code-block:: Java
+    
+    String uri = "hdfs://{hdfs_host}:{port}/data/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+        
+Native Memory Management
+===========
+
+To gain better performance and reducing code complexity, Java ``FileSystemDataset`` internally relys on C++ ``arrow::dataset::FileSystemDataset`` via JNI.
+As a result, All Arrow data read from ``FileSystemDataset`` is supposed to be allocated off the JVM heap. To manage this part of memory, An utility class
+``NativeMemoryPool`` is provided to users.
+
+As a basic example, by using a listenable ``NativeMemoryPool``, User can pass a listener hooking on C++ buffer allocation/deallocation:

Review comment:
       User -> user




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zhztheplayer commented on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

zhztheplayer commented on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-841764639


   Thanks @kiszk for helping checking this :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-943459297


   Revision: 4d60bb0cfc6e51c13c22d90e0cee60da6fd08371
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-957](https://github.com/ursacomputing/crossbow/branches/all?query=actions-957)
   
   |Task|Status|
   |----|------|
   |test-r-devdocs|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-957-github-test-r-devdocs)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-957-github-test-r-devdocs)|
   |test-ubuntu-20.10-docs|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-957-azure-test-ubuntu-20.10-docs)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-957-azure-test-ubuntu-20.10-docs)|
   |test-ubuntu-default-docs|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-957-azure-test-ubuntu-default-docs)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-957-azure-test-ubuntu-default-docs)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-1016767910


   Benchmark runs are scheduled for baseline = 39adf19f31a529eaec35704685532feee1d8c7a4 and contender = 58ca356659067577e6932a636cebafb6ccc7c0df. 58ca356659067577e6932a636cebafb6ccc7c0df is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/64f3d0db494140c2a5ae6a8cca285abd...4a1d2a9e94b3405493f247dac7e8514d/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/0065dcf5208945389ffedc3bea3bfb7f...f6a760e874f84dba8985797b46682189/)
   [Finished :arrow_down:0.0% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a0344798d5ac4fb09f2c57ff8aa78888...9748f4e4a7b54de3b7494950ab56dd8c/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-1016767910


   Benchmark runs are scheduled for baseline = 39adf19f31a529eaec35704685532feee1d8c7a4 and contender = 58ca356659067577e6932a636cebafb6ccc7c0df. 58ca356659067577e6932a636cebafb6ccc7c0df is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/64f3d0db494140c2a5ae6a8cca285abd...4a1d2a9e94b3405493f247dac7e8514d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/0065dcf5208945389ffedc3bea3bfb7f...f6a760e874f84dba8985797b46682189/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a0344798d5ac4fb09f2c57ff8aa78888...9748f4e4a7b54de3b7494950ab56dd8c/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zhztheplayer commented on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

zhztheplayer commented on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-841764639


   Thanks @kiszk for helping checking this :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r632973406



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is
+supposed to be located from a traditional file system, but Dataset API is not designed only for querying files but can be extended to serve all possible data sources
+such as from inter-process communication or from other network locations, etc. 
+
+Getting Started
+===========
+
+Below shows a simplest example of using Dataset to query a Parquet file in Java:
+
+.. code-block:: Java
+
+    // read data from file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    Dataset dataset = factory.finish();
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[0], 100));
+    List<ArrowRecordBatch> batches = StreamSupport.stream(scanner.scan().spliterator(), false)
+        .flatMap(t -> stream(t.execute()))
+        .collect(Collectors.toList());
+    
+    // do something with read record batches, for example:
+    analyzeArrowData(batches);
+    
+    // finished the analysis of the data, close all resources:
+    AutoCloseables.close(batches);
+    AutoCloseables.close(factory, dataset, scanner);
+
+.. note::
+    ``ArrowRecordBatch`` is a low-level composite Arrow data exchange format that doesn't provide API to read typed data from it directly. It's recommended
+    to use utilities ``VectorLoader`` to load it into a schema aware container ``VectorSchemaRoot`` by which user could be able to access decoded data
+    conveniently in Java.
+
+.. seealso::
+   Load record batches with :doc:`VectorSchemaRoot <vector_schema_root>`.
+
+Schema
+===========
+
+Schema of the data to be queried can be inspected via method ``DatasetFactory#inspect()`` before actually reading it. For example:
+
+.. code-block:: Java
+
+    // read data from local file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    
+    // inspect schema
+    Schema schema = factory.inspect();
+ 
+For some of the data format that is compatible with a user-defined schema, user can use method ``DatasetFactory#inspect(Schema schema)`` to create the dataset:
+
+.. code-block:: Java
+
+    Schema schema = createUserSchema()
+    Dataset dataset = factory.finish(schema);
+
+Otherwise when the non-parameter method ``DatasetFactory#inspect()`` is called, schema will be inferred automatically from data source. The same as the result of
+``DatasetFactory#inspect()``.
+
+Also, if projector is specified during scanning (see next section :ref:`Projection`), the actual schema of output data can be got within method ``Scanner::schema()``:
+
+.. code-block:: Java
+
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[] {"id", "name"}, 100));
+    Schema projectedSchema = scanner.schema();
+ 
+Projection
+===========
+
+User can specify projections in ScanOptions. For ``FileSystemDataset``, only column projection is allowed for now. Which means, only column names
+in the projection list will be accepted. For example:
+
+.. code-block:: Java
+
+    String[] projection = new String[] {"id", "name"};
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+If no projection is needed, specify an empty String array ``new String[0]`` in ScanOptions:
+
+.. code-block:: Java
+
+    String[] projection = new String[0];
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+This way all column will be emitted during scanning.
+
+Read data from HDFS
+===========
+
+``FileSystemDataset`` supports reading data from non-local file systems. HDFS support is included in the official Apache Arrow Java package releases and
+can be used directly without re-building the source code.
+To access HDFS data using Dataset API, pass a general HDFS URI to ``FilesSystemDatasetFactory``:
+
+.. code-block:: Java
+    
+    String uri = "hdfs://{hdfs_host}:{port}/data/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+        
+Native Memory Management
+===========
+
+To gain better performance and reducing code complexity, Java ``FileSystemDataset`` internally relys on C++ ``arrow::dataset::FileSystemDataset`` via JNI.

Review comment:
       reducing -> reduce
   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r633105004



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is

Review comment:
       paritionning -> partitioning




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r633105760



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is
+supposed to be located from a traditional file system, however Arrow Dataset is not designed only for querying files but can be extended to serve all possible data sources
+such as from inter-process communication or from other network locations, etc. 
+
+Getting Started
+===========
+
+Below shows a simplest example of using Dataset to query a Parquet file in Java:
+
+.. code-block:: Java
+
+    // read data from file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    Dataset dataset = factory.finish();
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[0], 100));
+    List<ArrowRecordBatch> batches = StreamSupport.stream(scanner.scan().spliterator(), false)
+        .flatMap(t -> stream(t.execute()))
+        .collect(Collectors.toList());
+    
+    // do something with read record batches, for example:
+    analyzeArrowData(batches);
+    
+    // finished the analysis of the data, close all resources:
+    AutoCloseables.close(batches);
+    AutoCloseables.close(factory, dataset, scanner);
+
+.. note::
+    ``ArrowRecordBatch`` is a low-level composite Arrow data exchange format that doesn't provide API to read typed data from it directly. It's recommended
+    to use utilities ``VectorLoader`` to load it into a schema aware container ``VectorSchemaRoot`` by which user could be able to access decoded data
+    conveniently in Java.
+
+.. seealso::
+   Load record batches with :doc:`VectorSchemaRoot <vector_schema_root>`.
+
+Schema
+===========
+
+Schema of the data to be queried can be inspected via method ``DatasetFactory#inspect()`` before actually reading it. For example:
+
+.. code-block:: Java
+
+    // read data from local file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    
+    // inspect schema
+    Schema schema = factory.inspect();
+ 
+For some of the data format that is compatible with a user-defined schema, user can use method ``DatasetFactory#inspect(Schema schema)`` to create the dataset:
+
+.. code-block:: Java
+
+    Schema schema = createUserSchema()
+    Dataset dataset = factory.finish(schema);
+
+Otherwise when the non-parameter method ``DatasetFactory#inspect()`` is called, schema will be inferred automatically from data source. The same as the result of
+``DatasetFactory#inspect()``.
+
+Also, if projector is specified during scanning (see next section :ref:`Projection`), the actual schema of output data can be got within method ``Scanner::schema()``:
+
+.. code-block:: Java
+
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[] {"id", "name"}, 100));
+    Schema projectedSchema = scanner.schema();
+ 
+Projection
+===========
+
+User can specify projections in ScanOptions. For ``FileSystemDataset``, only column projection is allowed for now, which means, only column names
+in the projection list will be accepted. For example:
+
+.. code-block:: Java
+
+    String[] projection = new String[] {"id", "name"};
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+If no projection is needed, specify an empty String array ``new String[0]`` in ScanOptions:
+
+.. code-block:: Java
+
+    String[] projection = new String[0];
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+This way all column will be emitted during scanning.

Review comment:
       nit: all column -> all columns
   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r729087158



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritioning strategies. Usually the data to be queried is

Review comment:
       @zhztheplayer Sorry for the delay here. Can you ensure the reStructuredText source is line-wrapped around 80 characters?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-943458327


   @github-actions crossbow submit *docs*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r729087861



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritioning strategies. Usually the data to be queried is
+supposed to be located from a traditional file system, however Arrow Dataset is not designed only for querying files but can be extended to serve all possible data sources
+such as from inter-process communication or from other network locations, etc. 
+
+Getting Started
+===========

Review comment:
       I think rendering errors may be emitted if the dashes are not at least the same amount as the title text.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r632973238



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is
+supposed to be located from a traditional file system, but Dataset API is not designed only for querying files but can be extended to serve all possible data sources
+such as from inter-process communication or from other network locations, etc. 
+
+Getting Started
+===========
+
+Below shows a simplest example of using Dataset to query a Parquet file in Java:
+
+.. code-block:: Java
+
+    // read data from file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    Dataset dataset = factory.finish();
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[0], 100));
+    List<ArrowRecordBatch> batches = StreamSupport.stream(scanner.scan().spliterator(), false)
+        .flatMap(t -> stream(t.execute()))
+        .collect(Collectors.toList());
+    
+    // do something with read record batches, for example:
+    analyzeArrowData(batches);
+    
+    // finished the analysis of the data, close all resources:
+    AutoCloseables.close(batches);
+    AutoCloseables.close(factory, dataset, scanner);
+
+.. note::
+    ``ArrowRecordBatch`` is a low-level composite Arrow data exchange format that doesn't provide API to read typed data from it directly. It's recommended
+    to use utilities ``VectorLoader`` to load it into a schema aware container ``VectorSchemaRoot`` by which user could be able to access decoded data
+    conveniently in Java.
+
+.. seealso::
+   Load record batches with :doc:`VectorSchemaRoot <vector_schema_root>`.
+
+Schema
+===========
+
+Schema of the data to be queried can be inspected via method ``DatasetFactory#inspect()`` before actually reading it. For example:
+
+.. code-block:: Java
+
+    // read data from local file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    
+    // inspect schema
+    Schema schema = factory.inspect();
+ 
+For some of the data format that is compatible with a user-defined schema, user can use method ``DatasetFactory#inspect(Schema schema)`` to create the dataset:
+
+.. code-block:: Java
+
+    Schema schema = createUserSchema()
+    Dataset dataset = factory.finish(schema);
+
+Otherwise when the non-parameter method ``DatasetFactory#inspect()`` is called, schema will be inferred automatically from data source. The same as the result of
+``DatasetFactory#inspect()``.
+
+Also, if projector is specified during scanning (see next section :ref:`Projection`), the actual schema of output data can be got within method ``Scanner::schema()``:
+
+.. code-block:: Java
+
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[] {"id", "name"}, 100));
+    Schema projectedSchema = scanner.schema();
+ 
+Projection
+===========
+
+User can specify projections in ScanOptions. For ``FileSystemDataset``, only column projection is allowed for now. Which means, only column names

Review comment:
       nit: now. -> now,  ??




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-1016767910


   Benchmark runs are scheduled for baseline = 39adf19f31a529eaec35704685532feee1d8c7a4 and contender = 58ca356659067577e6932a636cebafb6ccc7c0df. 58ca356659067577e6932a636cebafb6ccc7c0df is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/64f3d0db494140c2a5ae6a8cca285abd...4a1d2a9e94b3405493f247dac7e8514d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/0065dcf5208945389ffedc3bea3bfb7f...f6a760e874f84dba8985797b46682189/)
   [Finished :arrow_down:0.0% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a0344798d5ac4fb09f2c57ff8aa78888...9748f4e4a7b54de3b7494950ab56dd8c/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs closed pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kszucs closed pull request #10333:
URL: https://github.com/apache/arrow/pull/10333


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r633105004



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is

Review comment:
       nit: paritionning -> partitioning




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-841660817


   https://issues.apache.org/jira/browse/ARROW-12607


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r633105004



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is

Review comment:
       nit: paritionning -> partitioning

##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is

Review comment:
       paritionning -> partitioning

##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is
+supposed to be located from a traditional file system, however Arrow Dataset is not designed only for querying files but can be extended to serve all possible data sources
+such as from inter-process communication or from other network locations, etc. 
+
+Getting Started
+===========
+
+Below shows a simplest example of using Dataset to query a Parquet file in Java:
+
+.. code-block:: Java
+
+    // read data from file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    Dataset dataset = factory.finish();
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[0], 100));
+    List<ArrowRecordBatch> batches = StreamSupport.stream(scanner.scan().spliterator(), false)
+        .flatMap(t -> stream(t.execute()))
+        .collect(Collectors.toList());
+    
+    // do something with read record batches, for example:
+    analyzeArrowData(batches);
+    
+    // finished the analysis of the data, close all resources:
+    AutoCloseables.close(batches);
+    AutoCloseables.close(factory, dataset, scanner);
+
+.. note::
+    ``ArrowRecordBatch`` is a low-level composite Arrow data exchange format that doesn't provide API to read typed data from it directly. It's recommended
+    to use utilities ``VectorLoader`` to load it into a schema aware container ``VectorSchemaRoot`` by which user could be able to access decoded data
+    conveniently in Java.
+
+.. seealso::
+   Load record batches with :doc:`VectorSchemaRoot <vector_schema_root>`.
+
+Schema
+===========
+
+Schema of the data to be queried can be inspected via method ``DatasetFactory#inspect()`` before actually reading it. For example:
+
+.. code-block:: Java
+
+    // read data from local file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    
+    // inspect schema
+    Schema schema = factory.inspect();
+ 
+For some of the data format that is compatible with a user-defined schema, user can use method ``DatasetFactory#inspect(Schema schema)`` to create the dataset:
+
+.. code-block:: Java
+
+    Schema schema = createUserSchema()
+    Dataset dataset = factory.finish(schema);
+
+Otherwise when the non-parameter method ``DatasetFactory#inspect()`` is called, schema will be inferred automatically from data source. The same as the result of
+``DatasetFactory#inspect()``.
+
+Also, if projector is specified during scanning (see next section :ref:`Projection`), the actual schema of output data can be got within method ``Scanner::schema()``:
+
+.. code-block:: Java
+
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[] {"id", "name"}, 100));
+    Schema projectedSchema = scanner.schema();
+ 
+Projection
+===========
+
+User can specify projections in ScanOptions. For ``FileSystemDataset``, only column projection is allowed for now, which means, only column names
+in the projection list will be accepted. For example:
+
+.. code-block:: Java
+
+    String[] projection = new String[] {"id", "name"};
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+If no projection is needed, specify an empty String array ``new String[0]`` in ScanOptions:
+
+.. code-block:: Java
+
+    String[] projection = new String[0];
+    ScanOptions options = new ScanOptions(projection, 100);
+    
+This way all column will be emitted during scanning.

Review comment:
       nit: all column -> all columns
   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-1016767910


   Benchmark runs are scheduled for baseline = 39adf19f31a529eaec35704685532feee1d8c7a4 and contender = 58ca356659067577e6932a636cebafb6ccc7c0df. 58ca356659067577e6932a636cebafb6ccc7c0df is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/64f3d0db494140c2a5ae6a8cca285abd...4a1d2a9e94b3405493f247dac7e8514d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/0065dcf5208945389ffedc3bea3bfb7f...f6a760e874f84dba8985797b46682189/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a0344798d5ac4fb09f2c57ff8aa78888...9748f4e4a7b54de3b7494950ab56dd8c/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zhztheplayer commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

zhztheplayer commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r729139175



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritioning strategies. Usually the data to be queried is

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#issuecomment-943457635


   @emkornfield I don't know if you have time to give this a quick look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

Posted by GitBox <gi...@apache.org>.

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r632973238



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is
+supposed to be located from a traditional file system, but Dataset API is not designed only for querying files but can be extended to serve all possible data sources
+such as from inter-process communication or from other network locations, etc. 
+
+Getting Started
+===========
+
+Below shows a simplest example of using Dataset to query a Parquet file in Java:
+
+.. code-block:: Java
+
+    // read data from file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    Dataset dataset = factory.finish();
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[0], 100));
+    List<ArrowRecordBatch> batches = StreamSupport.stream(scanner.scan().spliterator(), false)
+        .flatMap(t -> stream(t.execute()))
+        .collect(Collectors.toList());
+    
+    // do something with read record batches, for example:
+    analyzeArrowData(batches);
+    
+    // finished the analysis of the data, close all resources:
+    AutoCloseables.close(batches);
+    AutoCloseables.close(factory, dataset, scanner);
+
+.. note::
+    ``ArrowRecordBatch`` is a low-level composite Arrow data exchange format that doesn't provide API to read typed data from it directly. It's recommended
+    to use utilities ``VectorLoader`` to load it into a schema aware container ``VectorSchemaRoot`` by which user could be able to access decoded data
+    conveniently in Java.
+
+.. seealso::
+   Load record batches with :doc:`VectorSchemaRoot <vector_schema_root>`.
+
+Schema
+===========
+
+Schema of the data to be queried can be inspected via method ``DatasetFactory#inspect()`` before actually reading it. For example:
+
+.. code-block:: Java
+
+    // read data from local file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    
+    // inspect schema
+    Schema schema = factory.inspect();
+ 
+For some of the data format that is compatible with a user-defined schema, user can use method ``DatasetFactory#inspect(Schema schema)`` to create the dataset:
+
+.. code-block:: Java
+
+    Schema schema = createUserSchema()
+    Dataset dataset = factory.finish(schema);
+
+Otherwise when the non-parameter method ``DatasetFactory#inspect()`` is called, schema will be inferred automatically from data source. The same as the result of
+``DatasetFactory#inspect()``.
+
+Also, if projector is specified during scanning (see next section :ref:`Projection`), the actual schema of output data can be got within method ``Scanner::schema()``:
+
+.. code-block:: Java
+
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[] {"id", "name"}, 100));
+    Schema projectedSchema = scanner.schema();
+ 
+Projection
+===========
+
+User can specify projections in ScanOptions. For ``FileSystemDataset``, only column projection is allowed for now. Which means, only column names

Review comment:
       nit: now. -> now,
   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org