You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/06/11 10:57:01 UTC

[GitHub] [arrow] tustvold commented on a diff in pull request #36027: GH-36028: [Documentation] Detailed parquet format support and parquet integration status

tustvold commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1225772849


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |

Review Comment:
   What is this referring to? 



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |

Review Comment:
   I'm not sure I'd consider this a feature of the parquet implementation, it is more a detail of the query engine imo?



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |

Review Comment:
   What is this?



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |

Review Comment:
   I think it would be clearer if you listed the actual encodings



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |

Review Comment:
   I wonder if we could have separate tables for supported physical types, encodings and compression



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |

Review Comment:
   ```suggestion
   | Column Pruning using projection pushdown    |       |        |        |       |       |
   ```



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |

Review Comment:
   I don't think any support page appending, the rust implementation supports appending column chunks though



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |

Review Comment:
   IMO this is a query engine detail, not a detail of the file format?



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |

Review Comment:
   I'm not sure what this is and how it differs from ColumnChunk



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel RowGroup processing              |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel Page processing                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Storage-aware defaults (1)                |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive concurrency (2)                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive IO when pruning used (3)         |       |        |        |       |       |

Review Comment:
   I'm not sure which parquet reader these features are based off, but my 2 cents is that they indicate a problematic IO abstraction that relies on prefetching heuristics instead of pushing vectored IO down into the IO subsystem (which the Rust implementation does).



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |

Review Comment:
   Again this is a detail of the query engine not the parquet implementation imo



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+

Review Comment:
   You can't not support this metadata, as otherwise the parquet file can't be read?



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |

Review Comment:
   I don't think this is supported by the format, bloom filters are per column chunk



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org