You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alippai (via GitHub)" <gi...@apache.org> on 2023/06/11 02:32:53 UTC

[GitHub] [arrow] alippai opened a new pull request, #36027: Detailed parquet and parquet integration support status

alippai opened a new pull request, #36027:
URL: https://github.com/apache/arrow/pull/36027

   This is a draft skeleton for: https://github.com/apache/arrow/issues/35638#issuecomment-1584966456


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on pull request #36027: GH-36028: [Documentation] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1585980413

   I'm sure this is too detailed in some places also there is a good chance that it misses many useful features.
   
   My approach was going through the [parquet-format changelog](https://github.com/apache/parquet-format/blob/master/CHANGES.md), the thrift file, the parquet-mr, arrow and arrow-rs issue queue. 
   
   I've intentionally tried to avoid 2.4-2.10 parquet format version info as it'd imply that the 2.9 features include 2.6 features which might not reflect the reality. Instead of that I've tried to focus on the end-user public API and providing a flat list of features instead. I'm open for different approaches as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226796853


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |

Review Comment:
   Isn't this also a detail of the engine choosing what columns to read or not? Or is the intent here to indicate that rows/values can be pruned based on projection directly in the parquet lib?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tustvold commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226373960


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |

Review Comment:
   OMG, they finally added it - amazing, will get that incorporated into the rust writer/reader



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226793889


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+

Review Comment:
   Is the intention to indicate that the metadata is available through a public API rather than saying whether or not it is supported in general, since as @tustvold says, you have to support the metadata otherwise the file can't be read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1591730219

   I'll repeat what the rest said about engine/format differences and maybe offer some clarification.
   
   In C++ the picture is pretty clear, as the APIs tend to be focused on implementation:
   
   There is a C++ parquet module which is purely a parquet reader.
   There is a C++ datasets library which, using Acero, offers a lot of features on top of this
   
   In pyarrow the picture is pretty muddled, as the APIs are more focused on user experience:
   
   There is a pyarrow.parquet module, however, many of its features are powered by C++ datasets.  For example, the pyarrow.parquet module can read from S3 even the the C++ parquet module has no concept of S3 (it just has an abstraction for input streams).
   
   So I agree with the others that we should probably not base the features on the python API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1593161298

   Also, do we think this table might belong at https://parquet.apache.org/docs/ (and we could link to it from Arrow's docs)?  For example, the parquet-mr (java) implementation and the parquet.net (C#) implementation are not involved with the arrow project but are still standalone parquet readers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tustvold commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226372851


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel RowGroup processing              |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel Page processing                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Storage-aware defaults (1)                |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive concurrency (2)                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive IO when pruning used (3)         |       |        |        |       |       |

Review Comment:
   Perhaps just a "Vectorized IO Pushdown". I believe there are efforts to add such an API to parquet-mr



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1586551865

   Thanks @tustvold. I'll address the Page vs ColumnChunk issues and other improvement ideas. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tustvold commented on a diff in pull request #36027: GH-36028: [Documentation] Detailed parquet format support and parquet integration status

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1225775542


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel RowGroup processing              |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel Page processing                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Storage-aware defaults (1)                |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive concurrency (2)                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive IO when pruning used (3)         |       |        |        |       |       |

Review Comment:
   I'm not sure which parquet reader these features are based off, but my 2 cents is that they indicate a problematic IO abstraction that relies on prefetching heuristics instead of pushing vectored IO down into the IO subsystem (which the Rust, and proprietary DataBricks implementation does).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tustvold commented on a diff in pull request #36027: GH-36028: [Documentation] Detailed parquet format support and parquet integration status

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1225775230


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |

Review Comment:
   I don't think any support page appending, the semantics would be peculiar for things like dictionary pages, the rust implementation does support appending column chunks though



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tustvold commented on a diff in pull request #36027: GH-36028: [Documentation] Detailed parquet format support and parquet integration status

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1225775542


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel RowGroup processing              |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel Page processing                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Storage-aware defaults (1)                |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive concurrency (2)                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive IO when pruning used (3)         |       |        |        |       |       |

Review Comment:
   I'm not sure which parquet reader these features are based off, but my 2 cents is that they indicate a problematic IO abstraction that relies on prefetching heuristics instead of pushing vectored IO down into the IO subsystem (which the Rust, and proprietary DataBricks implementation do).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226061939


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |

Review Comment:
   Same, it's part of the current API, but I agree it's not consistent across implementations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226061566


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |

Review Comment:
   https://github.com/apache/parquet-format/blob/c766945d90935ebcd4e03fee13aad2b6efcadce3/src/main/thrift/parquet.thrift#L766



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226060456


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |

Review Comment:
   While arrow-rs needs datafusion for this functionality, arrow handles it without Acero. I don't have strong opinion though



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on pull request #36027: GH-36028: [Documentation] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1585980652

   @tustvold @mapleFU @westonpace @wgtmac What do you think? Would this be useful?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226064118


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel RowGroup processing              |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel Page processing                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Storage-aware defaults (1)                |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive concurrency (2)                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive IO when pruning used (3)         |       |        |        |       |       |

Review Comment:
   I wanted to capture the IO pushdown section https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/#io-pushdown but also added more. Likely out of scope as none of the implementations goes into details or provides an API



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226798770


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |

Review Comment:
   Isn't Parquet itself a *write-once* format that can't be appended to? I'm not sure what these are supposed to indicate. The inability to append/delete without re-writing a Parquet file is why table formats like Iceberg and Delta have proliferated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1598019587

   Moved it to the parquet-site repo: https://github.com/apache/parquet-site/pull/34


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226061227


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |

Review Comment:
   Like I said there is a good chance I made a mistake here. I saw this in the thrift spec: ColumnChunk->ColumnMetadata->Statistics



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1593201839

   Thanks, I can do another round on the weekend on the correct website and the suggestions included 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36027: Detailed parquet and parquet integration support status

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1585979262

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36027: GH-36028: [Documentation] Detailed parquet format support and parquet integration status

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1585980309

   :warning: GitHub issue #36028 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tustvold commented on a diff in pull request #36027: GH-36028: [Documentation] Detailed parquet format support and parquet integration status

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1225773860


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |

Review Comment:
   I think it would be clearer if you listed the actual encodings, perhaps in a separate table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1591733456

   Although...to play devil's advocate...it might be odd when a feature is available in the parquet reader, but not yet exposed in the query component.  For example, there is some row skipping and bloom filters in the C++ parquet reader, but we haven't integrated those into the datasets layer yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai closed pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai closed pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status
URL: https://github.com/apache/arrow/pull/36027


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226062318


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |

Review Comment:
   Yes, likely some / most of the Page references should be ColumnChunk. I'll read about this more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226061443


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |

Review Comment:
   It's part of the arrow API in python



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226080301


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |

Review Comment:
   Could we organize these items in a layered fashion? Maybe this is a good start point: https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+

Review Comment:
   Are these intended for the completeness of fields defined in the metadata? If yes, probably they worth a separate table and indicate the states of each field. But that sounds too complicated.



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |

Review Comment:
   I agree with @tustvold, `partitioning` is more like a high-level use case on top of file format.



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |

Review Comment:
   The `Java` column could be misleading here. In the arrow repo, there is a java dataset reader to support reading from parquet dataset. If this is for parquet-mr, then it can be easily out of sync.



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |

Review Comment:
   +1 for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tustvold commented on a diff in pull request #36027: GH-36028: [Documentation] Detailed parquet format support and parquet integration status

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1225772849


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |

Review Comment:
   What is this referring to? 



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |

Review Comment:
   I'm not sure I'd consider this a feature of the parquet implementation, it is more a detail of the query engine imo?



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |

Review Comment:
   What is this?



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |

Review Comment:
   I think it would be clearer if you listed the actual encodings



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |

Review Comment:
   I wonder if we could have separate tables for supported physical types, encodings and compression



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |

Review Comment:
   ```suggestion
   | Column Pruning using projection pushdown    |       |        |        |       |       |
   ```



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |

Review Comment:
   I don't think any support page appending, the rust implementation supports appending column chunks though



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |

Review Comment:
   IMO this is a query engine detail, not a detail of the file format?



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |

Review Comment:
   I'm not sure what this is and how it differs from ColumnChunk



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition append / delete                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup append / delete                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page append / delete                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page CRC32 checksum                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel partition processing             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel RowGroup processing              |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Parallel Page processing                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Storage-aware defaults (1)                |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive concurrency (2)                  |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Adaptive IO when pruning used (3)         |       |        |        |       |       |

Review Comment:
   I'm not sure which parquet reader these features are based off, but my 2 cents is that they indicate a problematic IO abstraction that relies on prefetching heuristics instead of pushing vectored IO down into the IO subsystem (which the Rust implementation does).



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |

Review Comment:
   Again this is a detail of the query engine not the parquet implementation imo



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+

Review Comment:
   You can't not support this metadata, as otherwise the parquet file can't be read?



##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Modular encryption                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| External column data                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Nanosecond support                        |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| FIXED_LEN_BYTE_ARRAY                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete Delta encoding support           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Complete RLE support                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| BYTE_STREAM_SPLIT                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Partition pruning on the partition column |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using statistics         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup pruning using bloom filter       |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using projection pushdown    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using statistics             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page pruning using bloom filter           |       |        |        |       |       |

Review Comment:
   I don't think this is supported by the format, bloom filters are per column chunk



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36027:
URL: https://github.com/apache/arrow/pull/36027#discussion_r1226380305


##########
docs/source/status.rst:
##########
@@ -348,3 +348,107 @@ Notes:
 * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``)
 
 * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``)
+
+
+Parquet format public API details
+=================================
+
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Format                                    | C++   | Python | Java   | Go    | Rust  |
+|                                           |       |        |        |       |       |
++===========================================+=======+========+========+=======+=======+
+| Basic compression                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Brotli, LZ4, ZSTD                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| LZ4_RAW                                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Hive-style partitioning                   |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| File metadata                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| RowGroup metadata                         |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Column metadata                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Chunk metadta                             |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Sorting column                            |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| ColumnIndex statistics                    |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Page statistics                           |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| Statistics min_value                      |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| xxHash based bloom filter                 |       |        |        |       |       |
++-------------------------------------------+-------+--------+--------+-------+-------+
+| bloom filter length                       |       |        |        |       |       |

Review Comment:
   > OMG, they finally added it - amazing, will get that incorporated into the rust writer/reader
   
   I just added it recently :) Please note that the latest format is not released yet so the parquet-mr does not know `bloom_filter_length` now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36027: GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36027:
URL: https://github.com/apache/arrow/pull/36027#issuecomment-1593178321

   Agreed with @westonpace.
   I created https://issues.apache.org/jira/browse/PARQUET-2310 to propose adding this in the Parquet docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org