You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/15 15:29:40 UTC

[GitHub] [arrow] kiszk commented on a change in pull request #10333: ARROW-12607: [Website] Doc section for Dataset Java bindings

kiszk commented on a change in pull request #10333:
URL: https://github.com/apache/arrow/pull/10333#discussion_r632973238



##########
File path: docs/source/java/dataset.rst
##########
@@ -0,0 +1,192 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+Dataset
+===========
+
+.. warning::
+
+    Experimental: The Java module ``dataset`` is currently under early development. API might be changed in each release of Apache Arrow until it gets mature.
+
+Dataset is an universal layer in Apache Arrow for querying data in different formats or in different paritionning strategies. Usually the data to be queried is
+supposed to be located from a traditional file system, but Dataset API is not designed only for querying files but can be extended to serve all possible data sources
+such as from inter-process communication or from other network locations, etc. 
+
+Getting Started
+===========
+
+Below shows a simplest example of using Dataset to query a Parquet file in Java:
+
+.. code-block:: Java
+
+    // read data from file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    Dataset dataset = factory.finish();
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[0], 100));
+    List<ArrowRecordBatch> batches = StreamSupport.stream(scanner.scan().spliterator(), false)
+        .flatMap(t -> stream(t.execute()))
+        .collect(Collectors.toList());
+    
+    // do something with read record batches, for example:
+    analyzeArrowData(batches);
+    
+    // finished the analysis of the data, close all resources:
+    AutoCloseables.close(batches);
+    AutoCloseables.close(factory, dataset, scanner);
+
+.. note::
+    ``ArrowRecordBatch`` is a low-level composite Arrow data exchange format that doesn't provide API to read typed data from it directly. It's recommended
+    to use utilities ``VectorLoader`` to load it into a schema aware container ``VectorSchemaRoot`` by which user could be able to access decoded data
+    conveniently in Java.
+
+.. seealso::
+   Load record batches with :doc:`VectorSchemaRoot <vector_schema_root>`.
+
+Schema
+===========
+
+Schema of the data to be queried can be inspected via method ``DatasetFactory#inspect()`` before actually reading it. For example:
+
+.. code-block:: Java
+
+    // read data from local file /opt/example.parquet
+    String uri = "file:///opt/example.parquet";
+    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    DatasetFactory factory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(),
+        FileFormat.PARQUET, uri);
+    
+    // inspect schema
+    Schema schema = factory.inspect();
+ 
+For some of the data format that is compatible with a user-defined schema, user can use method ``DatasetFactory#inspect(Schema schema)`` to create the dataset:
+
+.. code-block:: Java
+
+    Schema schema = createUserSchema()
+    Dataset dataset = factory.finish(schema);
+
+Otherwise when the non-parameter method ``DatasetFactory#inspect()`` is called, schema will be inferred automatically from data source. The same as the result of
+``DatasetFactory#inspect()``.
+
+Also, if projector is specified during scanning (see next section :ref:`Projection`), the actual schema of output data can be got within method ``Scanner::schema()``:
+
+.. code-block:: Java
+
+    Scanner scanner = dataset.newScan(new ScanOptions(new String[] {"id", "name"}, 100));
+    Schema projectedSchema = scanner.schema();
+ 
+Projection
+===========
+
+User can specify projections in ScanOptions. For ``FileSystemDataset``, only column projection is allowed for now. Which means, only column names

Review comment:
       nit: now. -> now,  ??




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org