You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/01 12:26:41 UTC

[GitHub] [arrow-site] alamb opened a new pull request, #246: ARROW-17909: [WEBSITE] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

alamb opened a new pull request, #246:
URL: https://github.com/apache/arrow-site/pull/246

   Part 1: https://github.com/apache/arrow-site/pull/245
   
   See rationale on https://issues.apache.org/jira/browse/ARROW-17907
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r985217862


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns

Review Comment:
   Added in 672d09b611



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] wjones127 commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
wjones127 commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r985263582


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,341 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Goin back to the JSON documents above, this format could be stored in this parquet schema

Review Comment:
   ```suggestion
   Going back to the JSON documents above, this format could be stored in this parquet schema
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [WEBSITE] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r985101654


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │

Review Comment:
   ```suggestion
     │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray

Review Comment:
   I think this could make it clearer the distinction between masks associated with the StructArray and those associated with the PrimitiveArray
   
   In particular, "c" should have validity and not "c.c1". Similarly "d" should have a validity and not "d.d1"



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```

Review Comment:
   Perhaps we should move this down to where we discuss the parquet encoding?



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │

Review Comment:
   ```suggestion
    │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
    │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
    │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
    │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns

Review Comment:
   We should probably include an example of an empty slice



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- “a” is not provided
+}
+```
+```json
+{
+  “a”: [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of  offsets [0, 1, 1, 3] contains 3 pairs of offsets, (0,1), (1,1), and (1,3), and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0]]
+1: []
+2: [child[1], child[2]]
+```
+
+For the example, above this would be encoded in arrow as
+
+```text
+a: ListArray
+  Offsets: [0, 1, 1, 3]
+  Validity: [true, false, true]
+  Children:
+    element: PrimitiveArray
+      Buffer[0]: [1, 2, ARBITRARY]
+      Validity: [true, true, false]
+```
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    └─────┘   ├─────┤    │ └─────┘   └─────┘│
+     Validity  │  3  │    │ Validity   Values│ │
+│              └─────┘    │                  │
+                          │ child[0]         │ │
+│                         │ PrimitiveArray   │
+               Offsets    │                  │ │
+│                         └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of 0 would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+a:
+  Data Page:
+    Repetition Levels: encode([0, 0, 0, 1])
+    Definition Levels: encode([3, 0, 2, 2])

Review Comment:
   Yeah it isn't, it should be `[3, 0, 2, 3]`



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- “a” is not provided
+}
+```
+```json
+{
+  “a”: [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of  offsets [0, 1, 1, 3] contains 3 pairs of offsets, (0,1), (1,1), and (1,3), and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0]]
+1: []
+2: [child[1], child[2]]
+```
+
+For the example, above this would be encoded in arrow as
+
+```text
+a: ListArray
+  Offsets: [0, 1, 1, 3]
+  Validity: [true, false, true]
+  Children:
+    element: PrimitiveArray
+      Buffer[0]: [1, 2, ARBITRARY]
+      Validity: [true, true, false]
+```
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    └─────┘   ├─────┤    │ └─────┘   └─────┘│
+     Validity  │  3  │    │ Validity   Values│ │
+│              └─────┘    │                  │
+                          │ child[0]         │ │
+│                         │ PrimitiveArray   │
+               Offsets    │                  │ │
+│                         └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of 0 would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+a:
+  Data Page:
+    Repetition Levels: encode([0, 0, 0, 1])
+    Definition Levels: encode([3, 0, 2, 2])
+    Values: encode([1, 2])
+```
+
+```text
+┌────────────────────────────────────────┐
+│  ┌─────┐       ┌─────┐                 │
+│  │  1  │       │  3  │                 │
+│  ├─────┤       ├─────┤                 │
+│  │  0  │       │  0  │                 │
+│  ├─────┤       ├─────┤        ┌─────┐  │
+│  │  1  │       │  2  │        │  1  │  │
+│  ├─────┤       ├─────┤        ├─────┤  │
+│  │  0  │       │  2  │        │  2  │  │
+│  └─────┘       └─────┘        └─────┘  │
+│  Definition   Repetition       Data    │
+│    Levels       Levels                 │
+│    "a"                                 │
+└────────────────────────────────────────┘

Review Comment:
   ```suggestion
   ┌────────────────────────────────────────┐
   │  ┌─────┐       ┌─────┐                 │
   │  │  3  │       │  0  │                 │
   │  ├─────┤       ├─────┤                 │
   │  │  0  │       │  0  │                 │
   │  ├─────┤       ├─────┤        ┌─────┐  │
   │  │  2  │       │  0  │        │  1  │  │
   │  ├─────┤       ├─────┤        ├─────┤  │
   │  │  3  │       │  1  │        │  2  │  │
   │  └─────┘       └─────┘        └─────┘  │
   │  Definition   Repetition       Data    │
   │    Levels       Levels                 │
   │    "a"                                 │
   └────────────────────────────────────────┘
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- “a” is not provided
+}
+```
+```json
+{
+  “a”: [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of  offsets [0, 1, 1, 3] contains 3 pairs of offsets, (0,1), (1,1), and (1,3), and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0]]
+1: []
+2: [child[1], child[2]]
+```
+
+For the example, above this would be encoded in arrow as
+
+```text
+a: ListArray
+  Offsets: [0, 1, 1, 3]
+  Validity: [true, false, true]
+  Children:
+    element: PrimitiveArray
+      Buffer[0]: [1, 2, ARBITRARY]
+      Validity: [true, true, false]
+```

Review Comment:
   ```suggestion
   ```
   The diagram is the same information better presented imo



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- “a” is not provided
+}
+```
+```json
+{
+  “a”: [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of  offsets [0, 1, 1, 3] contains 3 pairs of offsets, (0,1), (1,1), and (1,3), and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0]]
+1: []
+2: [child[1], child[2]]
+```
+
+For the example, above this would be encoded in arrow as

Review Comment:
   ```suggestion
   For the example above this would be encoded in arrow as
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- “a” is not provided
+}
+```
+```json
+{
+  “a”: [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of  offsets [0, 1, 1, 3] contains 3 pairs of offsets, (0,1), (1,1), and (1,3), and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0]]
+1: []
+2: [child[1], child[2]]
+```
+
+For the example, above this would be encoded in arrow as
+
+```text
+a: ListArray
+  Offsets: [0, 1, 1, 3]
+  Validity: [true, false, true]
+  Children:
+    element: PrimitiveArray
+      Buffer[0]: [1, 2, ARBITRARY]
+      Validity: [true, true, false]
+```
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  0  │   │ ??  ││ │

Review Comment:
   ```suggestion
        │  0  │   │  1  │    │ │  0  │   │  ??  ││ │
   │    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
        │  1  │   │  1  │    │ │  1  │   │ 2  ││ │
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- “a” is not provided
+}
+```
+```json
+{
+  “a”: [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of  offsets [0, 1, 1, 3] contains 3 pairs of offsets, (0,1), (1,1), and (1,3), and is therefore a ListArray of length 3 with the following values:

Review Comment:
   ```suggestion
   For example, the list of offsets `[0, 1, 1, 3]` contains 3 pairs of offsets, `(0,1)`, `(1,1)`, and `(1,3)`, and is therefore a ListArray of length 3 with the following values:
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- “a” is not provided
+}
+```
+```json
+{
+  “a”: [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of  offsets [0, 1, 1, 3] contains 3 pairs of offsets, (0,1), (1,1), and (1,3), and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0]]
+1: []
+2: [child[1], child[2]]
+```
+
+For the example, above this would be encoded in arrow as
+
+```text
+a: ListArray
+  Offsets: [0, 1, 1, 3]
+  Validity: [true, false, true]
+  Children:
+    element: PrimitiveArray
+      Buffer[0]: [1, 2, ARBITRARY]
+      Validity: [true, true, false]
+```
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    └─────┘   ├─────┤    │ └─────┘   └─────┘│
+     Validity  │  3  │    │ Validity   Values│ │
+│              └─────┘    │                  │
+                          │ child[0]         │ │
+│                         │ PrimitiveArray   │
+               Offsets    │                  │ │
+│                         └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of 0 would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+a:
+  Data Page:
+    Repetition Levels: encode([0, 0, 0, 1])
+    Definition Levels: encode([3, 0, 2, 2])
+    Values: encode([1, 2])
+```

Review Comment:
   ```suggestion
   ```
   Same as above



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- “a” is not provided
+}
+```
+```json
+{
+  “a”: [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of  offsets [0, 1, 1, 3] contains 3 pairs of offsets, (0,1), (1,1), and (1,3), and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0]]
+1: []
+2: [child[1], child[2]]
+```
+
+For the example, above this would be encoded in arrow as
+
+```text
+a: ListArray
+  Offsets: [0, 1, 1, 3]
+  Validity: [true, false, true]
+  Children:
+    element: PrimitiveArray
+      Buffer[0]: [1, 2, ARBITRARY]
+      Validity: [true, true, false]
+```
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    └─────┘   ├─────┤    │ └─────┘   └─────┘│
+     Validity  │  3  │    │ Validity   Values│ │
+│              └─────┘    │                  │
+                          │ child[0]         │ │
+│                         │ PrimitiveArray   │
+               Offsets    │                  │ │
+│                         └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of 0 would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+a:
+  Data Page:
+    Repetition Levels: encode([0, 0, 0, 1])
+    Definition Levels: encode([3, 0, 2, 2])
+    Values: encode([1, 2])
+```
+
+```text
+┌────────────────────────────────────────┐
+│  ┌─────┐       ┌─────┐                 │
+│  │  1  │       │  3  │                 │
+│  ├─────┤       ├─────┤                 │
+│  │  0  │       │  0  │                 │
+│  ├─────┤       ├─────┤        ┌─────┐  │
+│  │  1  │       │  2  │        │  1  │  │
+│  ├─────┤       ├─────┤        ├─────┤  │
+│  │  0  │       │  2  │        │  2  │  │
+│  └─────┘       └─────┘        └─────┘  │
+│  Definition   Repetition       Data    │
+│    Levels       Levels                 │
+│    "a"                                 │
+└────────────────────────────────────────┘
+```
+
+
+
+
+## Next up: Arbitrary Nesting: Lists of Structs and Structs of Lists
+
+In our next blog post <!-- When published, add link here --> we will explain how Parquet and Arrow combine these concepts to support arbitrary nesting of potentially nullable data structures. It will also explain why definition levels are 16 bit integers when we have only shown values `0` and `1` so far.

Review Comment:
   ```suggestion
   In our next blog post <!-- When published, add link here --> we will explain how Parquet and Arrow combine these concepts to support arbitrary nesting of potentially nullable data structures.
   ```
   
   We've shown nested definition levels that aren't `0` and `1` :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb merged pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb merged PR #246:
URL: https://github.com/apache/arrow-site/pull/246


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] github-actions[bot] commented on pull request #246: ARROW-17909: [WEBSITE] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #246:
URL: https://github.com/apache/arrow-site/pull/246#issuecomment-1264350210

   https://issues.apache.org/jira/browse/ARROW-17909


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990620831


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  0  │      │  1  │  │
+│  ├─────┤      ├─────┤      ├─────┤  │
+│  │  2  │      │  0  │      │  2  │  │
+│  ├─────┤      ├─────┤      └─────┘  │
+│  │  3  │      │  1  │               │
+│  └─────┘      └─────┘               │

Review Comment:
   Double checked with
   
   ```
   let mut list_builder = ListBuilder::new(Int32Builder::new());
   list_builder.values().append_value(1);
   list_builder.append(true);
   list_builder.append(false);
   list_builder.append(true);
   list_builder.values().append_null();
   list_builder.values().append_value(2);
   list_builder.append(true);
   let values = Arc::new(list_builder.finish()) as ArrayRef;
   
   let mut builder = LevelInfoBuilder::try_new(
       &Field::new("test", values.data_type().clone(), true),
       Default::default(),
   )
   .unwrap();
   builder.write(&values, 0..4);
   let levels = builder.finish();
   
   assert_eq!(levels.len(), 1);
   
   let list_level = levels.get(0).unwrap();
   
   let expected_level = LevelInfo {
       def_levels: Some(vec![3, 0, 1, 2, 3]),
       rep_levels: Some(vec![0, 0, 0, 0, 1]),
       non_null_indices: vec![0, 2],
       max_def_level: 3,
       max_rep_level: 1,
   };
   assert_eq!(list_level, &expected_level);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990621001


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.

Review Comment:
   ```suggestion
   *Protip*: the number of zeros in the `repetition` levels must match the number of rows in the column, and the first level must be 0.
   ```
   The repetition levels don't belong to a list, they belong to the leaf column



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r985219460


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray

Review Comment:
   done -- can you give it a look?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990271820


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  0  │      │  1  │  │
+│  ├─────┤      ├─────┤      ├─────┤  │
+│  │  2  │      │  0  │      │  2  │  │
+│  ├─────┤      ├─────┤      └─────┘  │
+│  │  3  │      │  1  │               │
+│  └─────┘      └─────┘               │

Review Comment:
   But there are 5 levels, as one of the arrays has two elements? It looks correct to me...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #246:
URL: https://github.com/apache/arrow-site/pull/246#issuecomment-1271730881

   I plan to publish this today / tomorrow if there are no final thoughts
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990623640


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  0  │      │  1  │  │
+│  ├─────┤      ├─────┤      ├─────┤  │
+│  │  2  │      │  0  │      │  2  │  │
+│  ├─────┤      ├─────┤      └─────┘  │
+│  │  3  │      │  1  │               │
+│  └─────┘      └─────┘               │

Review Comment:
   Thank you -- I was losing sleep over this particular example



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990620831


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  0  │      │  1  │  │
+│  ├─────┤      ├─────┤      ├─────┤  │
+│  │  2  │      │  0  │      │  2  │  │
+│  ├─────┤      ├─────┤      └─────┘  │
+│  │  3  │      │  1  │               │
+│  └─────┘      └─────┘               │

Review Comment:
   Double checked with
   
   ```
   let mut list_builder = ListBuilder::new(Int32Builder::new());
           list_builder.values().append_value(1);
           list_builder.append(true);
           list_builder.append(false);
           list_builder.append(true);
           list_builder.values().append_null();
           list_builder.values().append_value(2);
           list_builder.append(true);
           let values = Arc::new(list_builder.finish()) as ArrayRef;
   
           let mut builder = LevelInfoBuilder::try_new(
               &Field::new("test", values.data_type().clone(), true),
               Default::default(),
           )
           .unwrap();
           builder.write(&values, 0..4);
           let levels = builder.finish();
   
           assert_eq!(levels.len(), 1);
   
           let list_level = levels.get(0).unwrap();
   
           let expected_level = LevelInfo {
               def_levels: Some(vec![3, 0, 1, 2, 3]),
               rep_levels: Some(vec![0, 0, 0, 0, 1]),
               non_null_indices: vec![0, 2],
               max_def_level: 3,
               max_rep_level: 1,
           };
           assert_eq!(list_level, &expected_level);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990621079


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+

Review Comment:
   ```suggestion
   
   The example above would therefore be encoded as
   
   ```
   Or something to introduce the diagram



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r988661056


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)

Review Comment:
   ```suggestion
       "b2": 3    <-- b2 is always provided (not nullable)
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.

Review Comment:
   ```suggestion
   This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
   ```
   
   I don't think the qualifications are necessary given the following paragraph, and they make it harder to read



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.

Review Comment:
   ```suggestion
   Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
   ```
   
   Given we haven't shown a group yet :sweat_smile: 



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"

Review Comment:
   ```suggestion
     "b": {       <-- b is always provided (not nullable)
       "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided

Review Comment:
   ```suggestion
       "c1": 6        but when "c" is provided, c1 is also always provided (not nullable)
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d.d2`
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the JSON documents above, this format could be stored in this Parquet schema:
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Thus the Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which containing a variable number of other values. For example,
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- list elements can themselves be null
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in a `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index

Review Comment:
   ```suggestion
   As before, Arrow chooses to represent this in a hierarchical fashion as a ListArray. This contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d.d2`
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the JSON documents above, this format could be stored in this Parquet schema:
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Thus the Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which containing a variable number of other values. For example,
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- list elements can themselves be null
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in a `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index
+
+For example, the list of offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and is therefore a ListArray of length 3 with the following values:

Review Comment:
   ```suggestion
   For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and is therefore a ListArray of length 3 with the following values:
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d.d2`

Review Comment:
   ```suggestion
   A definition level of `1` would imply a null at the level of `d`
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d.d2`
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the JSON documents above, this format could be stored in this Parquet schema:
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Thus the Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which containing a variable number of other values. For example,
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- list elements can themselves be null
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this Parquet schema

Review Comment:
   I would move this down to the parquet section, same as was done for the section above



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d.d2`
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the JSON documents above, this format could be stored in this Parquet schema:
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Thus the Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which containing a variable number of other values. For example,
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- list elements can themselves be null
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in a `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index
+
+For example, the list of offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  1  │      │  1  │  │

Review Comment:
   ```suggestion
   │  │  1  │      │  0  │      │  1  │  │
   ```
   A sanity check you can do is that a repetition level of 0 indicates the start of a new row, therefore the number of zeros must match the number of rows



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990262123


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,341 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Goin back to the JSON documents above, this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                 ┌───────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐  │ ┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │  │ │  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤  │ ├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │  │ │  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤  │ ├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │  │ │ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘  │ └─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity │  Values   ││ Validity   Values│ │
+│            │           │  │ │            │           ││                  │
+             │ "c.c1"    │                 │ "d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │            │ Primitive ││ PrimitiveArray   │
+             │ Array     │                 │ Array     ││                  │ │
+│            └───────────┘  │ │            └───────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(1,3)`, and `(3,3)`, and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of 0 would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+
+```text
+┌─────────────────────────────────────┐

Review Comment:
   @tustvold  I don't think this example is correct (I think I mis translated it / mashed it up incorrectly). Among other issues, it has 5 values in definition/repetition levels but there are only 4 documents 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [WEBSITE] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r985093767


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  “a”: 1,      <-- the top level fields are a, b, c, and d
+  “b”: {
+    “b1”: 1,   <-- b1 and b2 are “nested” fields of “b”
+    “b2”: 3    <-- b2 is always provided (not null)
+   },
+ “d”: {
+   “d1”:  1    <-- d1 is a “nested” field of “d”
+  }
+}
+```
+```json
+{              <-- Second record
+  “a”: 2,
+  “b”: {
+    “b2”: 4    <-- note “b1” is NULL in this record
+  },
+  “c”: {       <-- note “c” was NULL in the first record
+    “c1”: 6        but when “c” is provided, c1 is also always provided
+  },
+  “d”: {
+    “d1”: 2,
+    “d2”: 1
+  }
+}
+```
+```json
+{              <-- Third record
+  “b”: {
+    “b1”: 5,
+    “b2”: 6
+  },
+  “c”: {
+    “c1”: 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: Int32)
+Field(name: “b”, nullable: false, datatype: Struct[
+  Field(name: “b1”, nullable: true, datatype: Int32),
+  Field(name: “b2”, nullable: false, datatype: Int32)
+])
+Field(name: “c”), nullable: true, datatype: Struct[
+  Field(name: “c1”, nullable: false, datatype: Int32)
+])
+Field(name: “d”), nullable: true, datatype: Struct[
+  Field(name: “d1”, nullable: false, datatype: Int32)
+  Field(name: “d2”, nullable: true, datatype: Int32)
+])
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  5  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐      ┌──────────────────────┐┌──────────────────┐ │
+│ │ ┌─────┐    ┌─────┐   │  │ │ │  ┌─────┐   ┌─────┐   ││ ┌─────┐   ┌─────┐│
+  │ │  0  │    │ ??  │   │      │  │  1  │   │  1  │   ││ │  0  │   │ ??  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  6  │   │      │  │  1  │   │  2  │   ││ │  1  │   │  1  ││ │
+│ │ ├─────┤    ├─────┤   │  │ │ │  ├─────┤   ├─────┤   ││ ├─────┤   ├─────┤│
+  │ │  1  │    │  7  │   │      │  │  0  │   │ ??  │   ││ │ ??  │   │ ??  ││ │
+│ │ └─────┘    └─────┘   │  │ │ │  └─────┘   └─────┘   ││ └─────┘   └─────┘│
+  │ Validity    Values   │      │  Validity   Values   ││ Validity   Values│ │
+│ │                      │  │ │ │                      ││                  │
+  │            "c.c1"    │      │            "d.d1"    ││ "d.d2"           │ │
+│ │            Primitive │  │ │ │            Primitive ││ PrimitiveArray   │
+  │            Array     │      │            Array     ││                  │ │
+│ └──────────────────────┘  │ │ └──────────────────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  1  │    │  2  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  0  │             │ │  │  5  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  “a”: [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- “a” is not provided
+}
+```
+```json
+{
+  “a”: [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: “a”, nullable: true, datatype: List(
+  Field(name: “element”, nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of  offsets [0, 1, 1, 3] contains 3 pairs of offsets, (0,1), (1,1), and (1,3), and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0]]
+1: []
+2: [child[1], child[2]]
+```
+
+For the example, above this would be encoded in arrow as
+
+```text
+a: ListArray
+  Offsets: [0, 1, 1, 3]
+  Validity: [true, false, true]
+  Children:
+    element: PrimitiveArray
+      Buffer[0]: [1, 2, ARBITRARY]
+      Validity: [true, true, false]
+```
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    └─────┘   ├─────┤    │ └─────┘   └─────┘│
+     Validity  │  3  │    │ Validity   Values│ │
+│              └─────┘    │                  │
+                          │ child[0]         │ │
+│                         │ PrimitiveArray   │
+               Offsets    │                  │ │
+│                         └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of 0 would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+a:
+  Data Page:
+    Repetition Levels: encode([0, 0, 0, 1])
+    Definition Levels: encode([3, 0, 2, 2])

Review Comment:
   I am not sure this definition level is correct for this example



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990621001


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.

Review Comment:
   ```suggestion
   A consequence of this encoding is that the number of zeros in the `repetition` levels must match the number of rows in the column, and the first level must be 0.
   ```
   
   I think this avoids suggesting levels in some way are associated with the list column, and not just the leaf primitives. It also makes clear it is a consequence of the definition above, and not a suggestion :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #246:
URL: https://github.com/apache/arrow-site/pull/246#issuecomment-1272292180

   I made some last minute typo things. Let's do this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r988309984


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,341 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Goin back to the JSON documents above, this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                 ┌───────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐  │ ┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │  │ │  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤  │ ├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │  │ │  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤  │ ├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │  │ │ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘  │ └─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity │  Values   ││ Validity   Values│ │
+│            │           │  │ │            │           ││                  │
+             │ "c.c1"    │                 │ "d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │            │ Primitive ││ PrimitiveArray   │
+             │ Array     │                 │ Array     ││                  │ │
+│            └───────────┘  │ │            └───────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(1,3)`, and `(3,3)`, and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of 0 would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+
+```text
+┌─────────────────────────────────────┐

Review Comment:
   I will review this first thing tomorrow



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990622384


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  0  │      │  1  │  │
+│  ├─────┤      ├─────┤      ├─────┤  │
+│  │  2  │      │  0  │      │  2  │  │
+│  ├─────┤      ├─────┤      └─────┘  │
+│  │  3  │      │  1  │               │
+│  └─────┘      └─────┘               │
+│                                     │
+│ Definition  Repetition      Values  │
+│   Levels      Levels                │
+│  "a"                                │
+│                                     │
+└─────────────────────────────────────┘
+```
+
+
+
+## Next up: Arbitrary Nesting: Lists of Structs and Structs of Lists
+
+In our final blog post <!-- When published, add link here --> we will explain how Parquet and Arrow combine these concepts to support arbitrary nesting of potentially nullable data structures.

Review Comment:
   ```suggestion
   In our final blog post <!-- When published, add link here --> we will explain how Parquet and Arrow combine these concepts to support arbitrary nesting of potentially nullable data structures. 
   
   If you just want to get stuck in with the code, you will be pleased to hear that with the Rust [parquet](https://crates.io/crates/parquet) implementation, reading and writing nested data either into Arrow is as simple as reading unnested data, with all the complex record shredding handled automatically for you. With this and other exciting features, such as out of the box support for [reading asynchronously](https://docs.rs/parquet/22.0.0/parquet/arrow/async_reader/index.html) from [object storage](https://docs.rs/object_store/0.5.0/object_store/), and advanced row filter pushdown, blog post to follow, it is the fastest and most feature complete Rust parquet implementation. We look forward to seeing what you build with it!
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990621001


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.

Review Comment:
   ```suggestion
   A consequence of this encoding is that the number of zeros in the `repetition` levels is the total number of rows in the column, and the first level in a column must be 0.
   ```
   
   I think this avoids suggesting levels in some way are associated with the list column, and not just the leaf primitives. It also makes clear it is a consequence of the definition above, and not a suggestion :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990621001


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.

Review Comment:
   ```suggestion
   A consequence of this encoding is that the number of zeros in the `repetition` levels is the total number of records in the column, and the first level in a column must be 0.
   ```
   
   I think this avoids suggesting levels in some way are associated with the list column, and not just the leaf primitives. It also makes clear it is a consequence of the definition above, and not a suggestion :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990621001


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.

Review Comment:
   ```suggestion
   A consequence of this encoding is that the number of zeros in the `repetition` levels must match the number of rows in the column, and the first level in a column must be 0.
   ```
   
   I think this avoids suggesting levels in some way are associated with the list column, and not just the leaf primitives. It also makes clear it is a consequence of the definition above, and not a suggestion :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990623570


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.

Review Comment:
   I will also try and refrain from changing the style from a more formal tone with things like `*protip*` lol



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r988901269


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists"
+date: "2022-10-01 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. The fist post <!-- todo add link when published --> covers the basics of data storage and validity encoding, and this post covers `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d.d2`
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the JSON documents above, this format could be stored in this Parquet schema:
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Thus the Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌──────────────────────┐ │   ┌──────────────────────┐ ┌──────────────────────┐  │
+│ │  ┌─────┐    ┌─────┐  │   │ │  ┌─────┐    ┌─────┐  │ │ ┌─────┐     ┌─────┐  │
+  │  │  0  │    │  6  │  │ │   │  │  1  │    │  1  │  │ │ │  1  │     │  1  │  │  │
+│ │  ├─────┤    ├─────┤  │   │ │  ├─────┤    ├─────┤  │ │ ├─────┤     └─────┘  │
+  │  │  1  │    │  7  │  │ │   │  │  1  │    │  2  │  │ │ │  2  │              │  │
+│ │  ├─────┤    └─────┘  │   │ │  ├─────┤    └─────┘  │ │ ├─────┤              │
+  │  │  1  │             │ │   │  │  0  │             │ │ │  0  │              │  │
+│ │  └─────┘             │   │ │  └─────┘             │ │ └─────┘              │
+  │                      │ │   │                      │ │                      │  │
+│ │  Definition   Data   │   │ │  Definition   Data   │ │ Definition   Data    │
+  │    Levels            │ │   │    Levels            │ │   Levels             │  │
+│ │                      │   │ │                      │ │                      │
+  │  "c.1"               │ │   │  "d.1"               │ │  "d.d2"              │  │
+│ └──────────────────────┘   │ └──────────────────────┘ └──────────────────────┘
+     "c"                   │      "d"                                             │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which containing a variable number of other values. For example,
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- list elements can themselves be null
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in a `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index
+
+For example, the list of offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  1  │      │  1  │  │

Review Comment:
   That is a great tip -- I will also add it to the text. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990262884


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  0  │      │  1  │  │
+│  ├─────┤      ├─────┤      ├─────┤  │
+│  │  2  │      │  0  │      │  2  │  │
+│  ├─────┤      ├─────┤      └─────┘  │
+│  │  3  │      │  1  │               │
+│  └─────┘      └─────┘               │

Review Comment:
   @tustvold I don't think this example is correct (I think I mis translated it / mashed it up incorrectly). Among other issues, it has 5 values in definition/repetition levels but there are only 4 documents 🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #246:
URL: https://github.com/apache/arrow-site/pull/246#issuecomment-1264605322

   > Some mistakes, likely mine 😅 
   
   It is very kind of you to provide such an out for me -- lol


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] iravid commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
iravid commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r985506018


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,341 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Goin back to the JSON documents above, this format could be stored in this parquet schema

Review Comment:
   ```suggestion
   Goin back to the JSON documents above, this format could be stored in this Parquet schema
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,341 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Goin back to the JSON documents above, this format could be stored in this parquet schema

Review Comment:
   ```suggestion
   Going back to the JSON documents above, this format could be stored in this Parquet schema
   ```



##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,341 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Goin back to the JSON documents above, this format could be stored in this parquet schema

Review Comment:
   ```suggestion
   Going back to the JSON documents above, this format could be stored in this Parquet schema:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r985219559


##########
_posts/2022-10-01-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,341 @@
+---
+layout: post
+title: Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists
+date: "2022-10-01 00:00:00"
+author: tustvold, alamb
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) for in memory processing and [Apache Parquet](https://parquet.apache.org/) for efficient storage. This post covers `Struct` and `List` types.
+
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a struct column, this is a column that contains one or more other columns.
+
+For example consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not null)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also always provided
+  },
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in this arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                  ┌──────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐   │┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │   ││  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │   ││  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤   │├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │   ││ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘   │└─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity  │ Values   ││ Validity   Values│ │
+│            │           │  │ │             │          ││                  │
+             │ "c.c1"    │                  │"d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │             │Primitive ││ PrimitiveArray   │
+             │ Array     │                  │Array     ││                  │ │
+│            └───────────┘  │ │             └──────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that aren’t groups. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of d.d2, which contains two nullable levels d and d2.
+
+A definition level of 0 would imply a null at the level of d:
+
+```json
+{
+}
+```
+
+A definition level of 1 would imply a null at the level of d.d2
+
+```json
+{
+  d: { .. }
+}
+```
+
+A definition level of 2 would imply a defined value for d.d2:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Goin back to the JSON documents above, this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+Thus the parquet encoding of the example would be:
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+             ┌───────────┐                 ┌───────────┐┌──────────────────┐ │
+│   ┌─────┐  │ ┌─────┐   │  │ │   ┌─────┐  │ ┌─────┐   ││ ┌─────┐   ┌─────┐│
+    │  0  │  │ │ ??  │   │        │  1  │  │ │  1  │   ││ │  0  │   │ ??  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤  │ ├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  6  │   │        │  1  │  │ │  2  │   ││ │  1  │   │  1  ││ │
+│   ├─────┤  │ ├─────┤   │  │ │   ├─────┤  │ ├─────┤   ││ ├─────┤   ├─────┤│
+    │  1  │  │ │  7  │   │        │  0  │  │ │ ??  │   ││ │ ??  │   │ ??  ││ │
+│   └─────┘  │ └─────┘   │  │ │   └─────┘  │ └─────┘   ││ └─────┘   └─────┘│
+    Validity │  Values   │        Validity │  Values   ││ Validity   Values│ │
+│            │           │  │ │            │           ││                  │
+             │ "c.c1"    │                 │ "d.d1"    ││ "d.d2"           │ │
+│            │ Primitive │  │ │            │ Primitive ││ PrimitiveArray   │
+             │ Array     │                 │ Array     ││                  │ │
+│            └───────────┘  │ │            └───────────┘└──────────────────┘
+     "c"                           "d"                                       │
+│    StructArray            │ │    StructArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types is columns containing a variable number of values. For example,
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],  <-- list elements of a are nullable
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+Documents of this format could be stored in this parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion with a list of monotonically increasing integers called *offsets* in the parent `ListArray`, and stores all the values that appear in the lists in a single child array. Each consecutive pair of elements in this offset array identifies a slice of the child array for that array index.
+
+For example, the list of offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(1,3)`, and `(3,3)`, and is therefore a ListArray of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+### Repetition Levels
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of 0 would imply a new list in the top-most repeated field, a value of 1 a new element within the top-most repeated field, a value of 2 a new element within the second top-most repeated field, and so on.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+
+```text
+┌─────────────────────────────────────┐

Review Comment:
   @tustvold  -- can you review this one again? I added a new empty list to the example as you suggested (so now the example has 4 documents rather than 3). I am not sure about the repetition levels



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990271820


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  0  │      │  1  │  │
+│  ├─────┤      ├─────┤      ├─────┤  │
+│  │  2  │      │  0  │      │  2  │  │
+│  ├─────┤      ├─────┤      └─────┘  │
+│  │  3  │      │  1  │               │
+│  └─────┘      └─────┘               │

Review Comment:
   But there are 5 levels, as one of the arrays has two elements?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #246:
URL: https://github.com/apache/arrow-site/pull/246#issuecomment-1272287832

   I am going to update the date and give this one more final review and then will get it published. Thanks again all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] alamb commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990458456


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.
+
+Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.
+
+
+```text
+┌─────────────────────────────────────┐
+│  ┌─────┐      ┌─────┐               │
+│  │  3  │      │  0  │               │
+│  ├─────┤      ├─────┤               │
+│  │  0  │      │  0  │               │
+│  ├─────┤      ├─────┤      ┌─────┐  │
+│  │  1  │      │  0  │      │  1  │  │
+│  ├─────┤      ├─────┤      ├─────┤  │
+│  │  2  │      │  0  │      │  2  │  │
+│  ├─────┤      ├─────┤      └─────┘  │
+│  │  3  │      │  1  │               │
+│  └─────┘      └─────┘               │

Review Comment:
   Awesome -- I will study it more carefully; Good thing there is a blog about this stuff I can read 😆 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-site] tustvold commented on a diff in pull request #246: ARROW-17909: [Website] Arrow and Parquet Part 2: Nested and Hierarchal Data using Structs and Lists

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #246:
URL: https://github.com/apache/arrow-site/pull/246#discussion_r990621001


##########
_posts/2022-10-07-arrow-parquet-encoding-part-2.md:
##########
@@ -0,0 +1,344 @@
+---
+layout: post
+title: "Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists"
+date: "2022-10-07 00:00:00"
+author: "tustvold and alamb"
+categories: [parquet, arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+This is the second, in a three part series exploring how projects such as [Rust Apache Arrow](https://github.com/apache/arrow-rs) support conversion between [Apache Arrow](https://arrow.apache.org/) and [Apache Parquet](https://parquet.apache.org/). The [first post](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/) covered the basics of data storage and validity encoding, and this post will cover the more complex `Struct` and `List` types.
+
+[Apache Arrow](https://arrow.apache.org/) is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. [Apache Parquet](https://parquet.apache.org/) is an open, column-oriented data file format designed for very efficient data encoding and retrieval.
+
+
+## Struct / Group Columns
+
+Both Parquet and Arrow have the concept of a *struct* column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.
+
+For example, consider the following three JSON documents
+
+```json
+{              <-- First record
+  "a": 1,      <-- the top level fields are a, b, c, and d
+  "b": {       <-- b is always provided (not nullable)
+    "b1": 1,   <-- b1 and b2 are "nested" fields of "b"
+    "b2": 3    <-- b2 is always provided (not nullable)
+   },
+ "d": {
+   "d1":  1    <-- d1 is a "nested" field of "d"
+  }
+}
+```
+```json
+{              <-- Second record
+  "a": 2,
+  "b": {
+    "b2": 4    <-- note "b1" is NULL in this record
+  },
+  "c": {       <-- note "c" was NULL in the first record
+    "c1": 6        but when "c" is provided, c1 is also
+  },               always provided (not nullable)
+  "d": {
+    "d1": 2,
+    "d2": 1
+  }
+}
+```
+```json
+{              <-- Third record
+  "b": {
+    "b1": 5,
+    "b2": 6
+  },
+  "c": {
+    "c1": 7
+  }
+}
+```
+Documents of this format could be stored in an Arrow `StructArray` with this schema
+
+```text
+Field(name: "a", nullable: true, datatype: Int32)
+Field(name: "b", nullable: false, datatype: Struct[
+  Field(name: "b1", nullable: true, datatype: Int32),
+  Field(name: "b2", nullable: false, datatype: Int32)
+])
+Field(name: "c"), nullable: true, datatype: Struct[
+  Field(name: "c1", nullable: false, datatype: Int32)
+])
+Field(name: "d"), nullable: true, datatype: Struct[
+  Field(name: "d1", nullable: false, datatype: Int32)
+  Field(name: "d2", nullable: true, datatype: Int32)
+])
+```
+
+
+Arrow represents each `StructArray` hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays
+
+```text
+  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
+  │                   │           ┌─────────────────┐ ┌────────────┐
+  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
+  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
+  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
+  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
+  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
+  │ Validity   Values │           │Validity   Values│ │   Values   │
+  │                   │        │  │                 │ │            │ │
+  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
+  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
+  └───────────────────┘           │                 │ │  Array     │
+                               │  └─────────────────┘ └────────────┘ │
+                                    "b"
+                               │    StructArray                      │
+                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
+│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
+   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
+│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
+   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
+│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
+   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
+│           │           │ │ │            │          ││                 │
+            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
+│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
+            │ Array     │                │Array     ││                 │ │
+│           └───────────┘ │ │            └──────────┘└─────────────────┘
+    "c"                         "d"                                      │
+│   StructArray           │ │   StructArray
+  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+ ```
+
+More technical detail is available in the [StructArray format specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout).
+
+### Definition Levels
+Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element, is the depth in the schema at which it is fully defined.
+
+For example consider the case of `d.d2`, which contains two nullable levels `d` and `d2`.
+
+A definition level of `0` would imply a null at the level of `d`:
+
+```json
+{
+}
+```
+
+A definition level of `1` would imply a null at the level of `d`
+
+```json
+{
+  d: { null }
+}
+```
+
+A definition level of `2` would imply a defined value for `d.d2`:
+
+```json
+{
+  d: { d2: .. }
+}
+```
+
+
+Going back to the three JSON documents above, they could be stored in Parquet with this schema
+
+```text
+message schema {
+  optional int32 a;
+  required group b {
+    optional int32 b1;
+    required int32 b2;
+  }
+  optional group c {
+    required int32 c1;
+  }
+  optional group d {
+    required int32 d1;
+    optional int32 d2;
+  }
+}
+```
+
+The Parquet encoding of the example would be:
+
+```text
+ ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
+ │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
+ │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
+ │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
+ │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
+ │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
+ │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
+ │                        │  │ │  └─────┘             │ │  └─────┘  │
+ │  Definition    Data    │    │                      │ │           │ │
+ │    Levels              │  │ │  Definition   Data   │ │   Data    │
+ │                        │    │    Levels            │ │           │ │
+ │  "a"                   │  │ │                      │ │           │
+ └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
+                             │ └──────────────────────┘ └───────────┘
+                                  "b"                                 │
+                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+
+
+┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
+│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
+  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
+│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
+  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
+│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
+  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
+│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
+  │                    │ │   │                    │ │                  │ │
+│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
+  │    Levels          │ │   │    Levels          │ │   Levels         │ │
+│ │                    │   │ │                    │ │                  │
+  │  "c.1"             │ │   │  "d.1"             │ │  "d.d2"          │ │
+│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
+     "c"                 │      "d"                                      │
+└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+ ```
+
+## List / Repeated Columns
+
+Closing out support for nested types are *lists*, which contain a variable number of other values. For example, the following four documents each have a (nullable) field `a` containing a list of integers
+
+```json
+{                     <-- First record
+  "a": [1],           <-- top-level field a containing list of integers
+}
+```
+```json
+{                     <-- "a" is not provided (is null)
+}
+```
+```json
+{                     <-- "a" is non-null but empty
+  "a": []
+}
+```
+```json
+{
+  "a": [null, 2],     <-- "a" has a null and non-null elements
+}
+```
+
+Documents of this format could be stored in this Arrow schema
+
+```text
+Field(name: "a", nullable: true, datatype: List(
+  Field(name: "element", nullable: true, datatype: Int32),
+)
+```
+
+As before, Arrow chooses to represent this in a hierarchical fashion as a `ListArray`. A `ListArray` contains a list of monotonically increasing integers called *offsets*, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray
+
+For example, a list with offsets `[0, 2, 3, 3]` contains 3 pairs of offsets, `(0,2)`, `(2,3)`, and `(3,3)`, and therefore represents a `ListArray` of length 3 with the following values:
+
+```text
+0: [child[0], child[1]]
+1: []
+2: [child[2]]
+```
+
+For the example above with 4 JSON documents, this would be encoded in Arrow as
+
+
+```text
+┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
+                          ┌──────────────────┐ │
+│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
+     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
+│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
+     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
+│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
+     │  1  │   │  1  │    │ Validity   Values│ │
+│    └─────┘   ├─────┤    │                  │
+               │  3  │    │ child[0]         │ │
+│    Validity  └─────┘    │ PrimitiveArray   │
+                          │                  │ │
+│              Offsets    └──────────────────┘
+     "a"                                       │
+│    ListArray
+ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
+```
+
+More technical detail is available in the [ListArray format specification](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
+
+
+### Parquet Repetition Levels
+
+The example above with 4 JSON documents can be stored in this Parquet schema
+
+```text
+message schema {
+  optional group a (LIST) {
+    repeated group list {
+      optional int32 element;
+    }
+  }
+}
+```
+
+In order to encode lists, Parquet stores an integer *repetition level* in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of `0` means a new list in the top-most repeated list, a value of `1` means a new element within the top-most repeated list, a value of `2` means a new element within the second top-most repeated list, and so on.
+
+*Protip*: for the topmost level list, the number of zeros in the `repetition` levels must match the number of rows.

Review Comment:
   ```suggestion
   *Protip*: the number of zeros in the `repetition` levels must match the number of rows.
   ```
   The repetition levels don't belong to a list, they belong to the leaf column



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org