You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/17 13:37:06 UTC

[GitHub] [arrow] lidavidm commented on a change in pull request #12634: ARROW-15576: [Java][Doc] Apache Arrow VectorSchemaRoots for 2D data

lidavidm commented on a change in pull request #12634:
URL: https://github.com/apache/arrow/pull/12634#discussion_r829123862



##########
File path: docs/source/java/vector_schema_root.rst
##########
@@ -15,21 +15,79 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-================
+.. default-domain:: java
+.. highlight:: java
+
+============
+Tabular Data
+============
+
+While arrays (aka: :doc:`ValueVector <./vector>`) represent a one-dimensional sequence of
+homogeneous values, data often comes in the form of two-dimensional sets of
+heterogeneous data (such as database tables, CSV files...). Arrow provides
+several abstractions to handle such data conveniently and efficiently.
+
+Fields
+======
+
+Fields are used to denote the particular columns of a table.
+A field, i.e. an instance of `Field`_, holds together a field name, a data
+type, and some optional key-value metadata.
+
+.. code-block:: Java
+
+    // Create a column "document" of string type with metadata
+    import org.apache.arrow.vector.types.pojo.ArrowType;
+    import org.apache.arrow.vector.types.pojo.Field;
+    import org.apache.arrow.vector.types.pojo.FieldType;
+
+    Map<String, String> metadata = new HashMap<>();
+    metadata.put("A", "Id card");
+    metadata.put("B", "Passport");
+    metadata.put("C", "Visa");
+    Field document = new Field("document", new FieldType(true, new ArrowType.Utf8(), /*dictionary*/ null, metadata), /*children*/ null);
+
+Schemas
+=======
+
+A `Schema`_ describes the overall structure consisting of any number of columns. It holds a sequence of fields together
+with some optional schema-wide metadata (in addition to per-field metadata).
+
+.. code-block:: Java
+
+    // Create a schema describing datasets with two columns:
+    // a int32 column "A" and a utf8-encoded string column "B"
+    import org.apache.arrow.vector.types.pojo.ArrowType;
+    import org.apache.arrow.vector.types.pojo.Field;
+    import org.apache.arrow.vector.types.pojo.FieldType;
+    import org.apache.arrow.vector.types.pojo.Schema;
+    import static java.util.Arrays.asList;
+
+    Map<String, String> metadata = new HashMap<>();
+    metadata.put("K1", "V1");
+    metadata.put("K2", "V2");
+    Field a = new Field("A", FieldType.nullable(new ArrowType.Int(32, true)), null);
+    Field b = new Field("B", FieldType.nullable(new ArrowType.Utf8()), null);
+    Schema schema = new Schema(asList(a, b), metadata);
+
 VectorSchemaRoot
 ================
+
+.. note::
+
+    VectorSchemaRoot is somewhat analogous to tables and record batches in the other Arrow implementations
+    in that they all are 2D datasets, but the usage is different.
+
 A :class:`VectorSchemaRoot` is a container that can hold batches, batches flow through :class:`VectorSchemaRoot`

Review comment:
       Not so sure about this since we haven't introduced batches yet at this point…but we can revisit this later.

##########
File path: docs/source/java/vector_schema_root.rst
##########
@@ -15,21 +15,79 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-================
+.. default-domain:: java
+.. highlight:: java
+
+============
+Tabular Data
+============
+
+While arrays (aka: :doc:`ValueVector <./vector>`) represent a one-dimensional sequence of
+homogeneous values, data often comes in the form of two-dimensional sets of
+heterogeneous data (such as database tables, CSV files...). Arrow provides
+several abstractions to handle such data conveniently and efficiently.
+
+Fields
+======
+
+Fields are used to denote the particular columns of a table.

Review comment:
       Wait, sorry I missed this. We should not talk about tables in Java. Since we haven't introduced VectorSchemaRoot yet, we can talk about "tabular data" abstractly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org