You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/06/25 11:36:04 UTC

[GitHub] [arrow] mapleFU opened a new pull request, #36290: GH-36284: [Python][Parquet] Support write page index in Python API

mapleFU opened a new pull request, #36290:
URL: https://github.com/apache/arrow/pull/36290

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Support `write_page_index` in Parquet Python API
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   support `write_page_index` in properties
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Currently not
   
   ### Are there any user-facing changes?
   
   User can generate page index here.
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1254402322


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1599,6 +1610,14 @@ cdef shared_ptr[WriterProperties] _create_writer_properties(
     # a size larger than this then it will be latched to this value.
     props.max_row_group_length(_MAX_ROW_GROUP_SIZE)
 
+    # page index
+
+    if isinstance(write_page_index, bool):

Review Comment:
   Why ignore the value if it's not boolean? This makes the API confusing.



##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -357,6 +357,20 @@ def test_field_id_metadata():
     assert schema[5].metadata[field_id] == b'-1000'
 
 
+def test_parquet_file_page_index():
+    table = pa.table({'a': [1, 2, 3]})
+
+    writer = pa.BufferOutputStream()
+    _write_table(table, writer, write_page_index=True)
+    reader = pa.BufferReader(writer.getvalue())
+
+    # Can retrieve sorting columns from metadata
+    metadata = pq.read_metadata(reader)
+    cc = metadata.row_group(0).column(0)
+    assert cc.has_offset_index is True
+    assert cc.has_column_index is True

Review Comment:
   Ok, but can we also have a test where these properties are false?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1254486377


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1599,6 +1610,14 @@ cdef shared_ptr[WriterProperties] _create_writer_properties(
     # a size larger than this then it will be latched to this value.
     props.max_row_group_length(_MAX_ROW_GROUP_SIZE)
 
+    # page index
+
+    if isinstance(write_page_index, bool):

Review Comment:
   long time not written Python so just reference the code style above. I'll change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253014060


##########
python/pyarrow/_parquet.pyx:
##########
@@ -493,6 +493,16 @@ cdef class ColumnChunkMetaData(_Weakrefable):
         """Uncompressed size in bytes (int)."""
         return self.metadata.total_uncompressed_size()
 
+    @property
+    def has_offset_index(self):
+        """Has offset index"""

Review Comment:
   A actual question for this PR: we already do have a `has_index_page` attribute. How does that exactly differ?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253068465


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   I guess most time there is no performance issue. But when user has extremly long string, we might write to much data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1254727262


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1599,6 +1610,14 @@ cdef shared_ptr[WriterProperties] _create_writer_properties(
     # a size larger than this then it will be latched to this value.
     props.max_row_group_length(_MAX_ROW_GROUP_SIZE)
 
+    # page index
+
+    if isinstance(write_page_index, bool):

Review Comment:
   Sorry...maybe I forgot to pop the stash



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1629207496

   @github-actions crossbow submit -g python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1606104562

   @jorisvandenbossche Mind take a look? I'm not so familiar with Python part, so maybe make something wrong


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621633045

   @github-actions crossbow submit -g python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253072174


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   Sure. Here it will "discard" too long statistics, and discard the page index. I will implement truncate in the future



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253069386


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   We are allowed to trim the min/max values, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253059920


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   Currently not, I found it's hard to implement page index pruning in current implementions. If we implements it, maybe we can change it to default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1646225419

   After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 12f45ba393e076461648781145ac65896bdcdf2f.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/15247427485) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1617442265

   @pitrou @westonpace Would you mind take a look? This patch support Python to write `page_index`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621611304

   > Hypothesis failure is a new one but I do not see how it could be related to this PR.
   
   Hmm, that seems very similar to the one that I fixed last week (https://github.com/apache/arrow/issues/36349, but now with another unknown timezone). In any case, you can ignore it here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253072174


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   Sure. Here it will "truncate" it, and discard the page index. I will implement truncate in the future



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1629212205

   Revision: 9840291d1e8d2a826035e4cac40490fdbffe47e1
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-f780c64692](https://github.com/ursacomputing/crossbow/branches/all?query=actions-f780c64692)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.10)](https://github.com/ursacomputing/crossbow/actions/runs/5510057993/jobs/10043692666)|
   |test-conda-python-3.10-hdfs-2.9.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/5510060691/jobs/10043697451)|
   |test-conda-python-3.10-hdfs-3.2.1|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/5510058506/jobs/10043693548)|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5510063634/jobs/10043703173)|
   |test-conda-python-3.10-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.10-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/5510057023/jobs/10043691030)|
   |test-conda-python-3.10-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.10-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/5510062552/jobs/10043701415)|
   |test-conda-python-3.10-substrait|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.10-substrait)](https://github.com/ursacomputing/crossbow/actions/runs/5510055033/jobs/10043687516)|
   |test-conda-python-3.11|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.11)](https://github.com/ursacomputing/crossbow/actions/runs/5510057465/jobs/10043691624)|
   |test-conda-python-3.11-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5510061417/jobs/10043699038)|
   |test-conda-python-3.11-dask-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5510065556/jobs/10043707243)|
   |test-conda-python-3.11-hypothesis|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.11-hypothesis)](https://github.com/ursacomputing/crossbow/actions/runs/5510062165/jobs/10043700493)|
   |test-conda-python-3.11-pandas-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.11-pandas-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5510064334/jobs/10043704783)|
   |test-conda-python-3.8|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.8)](https://github.com/ursacomputing/crossbow/actions/runs/5510061813/jobs/10043699720)|
   |test-conda-python-3.8-pandas-1.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.8-pandas-1.0)](https://github.com/ursacomputing/crossbow/actions/runs/5510055358/jobs/10043688390)|
   |test-conda-python-3.8-spark-v3.1.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.8-spark-v3.1.2)](https://github.com/ursacomputing/crossbow/actions/runs/5510055826/jobs/10043689438)|
   |test-conda-python-3.9|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.9)](https://github.com/ursacomputing/crossbow/actions/runs/5510064764/jobs/10043705688)|
   |test-conda-python-3.9-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.9-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5510063939/jobs/10043703892)|
   |test-conda-python-3.9-spark-v3.2.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-conda-python-3.9-spark-v3.2.0)](https://github.com/ursacomputing/crossbow/actions/runs/5510054045/jobs/10043685778)|
   |test-cuda-python|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-cuda-python)](https://github.com/ursacomputing/crossbow/actions/runs/5510058962/jobs/10043694454)|
   |test-debian-11-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-f780c64692-azure-test-debian-11-python-3)](https://github.com/ursacomputing/crossbow/runs/14916440231)|
   |test-fedora-35-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-f780c64692-azure-test-fedora-35-python-3)](https://github.com/ursacomputing/crossbow/runs/14916465558)|
   |test-ubuntu-20.04-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-f780c64692-azure-test-ubuntu-20.04-python-3)](https://github.com/ursacomputing/crossbow/runs/14916451219)|
   |test-ubuntu-22.04-python-3|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f780c64692-github-test-ubuntu-22.04-python-3)](https://github.com/ursacomputing/crossbow/actions/runs/5510060323/jobs/10043696731)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253066684


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   Even if it's  not used already, it would probably be beneficial to write files with the index enabled, for future use.
   Is there a performance issue with enabling it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621825392

   Still these failed, lol


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621637328

   Revision: 9a1a69f95f032a54ebfa085b0935c802bfa87159
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-e8a97fb00e](https://github.com/ursacomputing/crossbow/branches/all?query=actions-e8a97fb00e)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.10)](https://github.com/ursacomputing/crossbow/actions/runs/5464113322/jobs/9945702419)|
   |test-conda-python-3.10-hdfs-2.9.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/5464113561/jobs/9945703131)|
   |test-conda-python-3.10-hdfs-3.2.1|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/5464108145/jobs/9945690650)|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5464109560/jobs/9945693766)|
   |test-conda-python-3.10-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.10-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/5464110376/jobs/9945695680)|
   |test-conda-python-3.10-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.10-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/5464114099/jobs/9945704427)|
   |test-conda-python-3.10-substrait|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.10-substrait)](https://github.com/ursacomputing/crossbow/actions/runs/5464111323/jobs/9945697652)|
   |test-conda-python-3.11|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.11)](https://github.com/ursacomputing/crossbow/actions/runs/5464112528/jobs/9945700389)|
   |test-conda-python-3.11-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5464109702/jobs/9945694210)|
   |test-conda-python-3.11-dask-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5464109205/jobs/9945693186)|
   |test-conda-python-3.11-hypothesis|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.11-hypothesis)](https://github.com/ursacomputing/crossbow/actions/runs/5464112801/jobs/9945701093)|
   |test-conda-python-3.11-pandas-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.11-pandas-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5464108448/jobs/9945691381)|
   |test-conda-python-3.8|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.8)](https://github.com/ursacomputing/crossbow/actions/runs/5464108692/jobs/9945691928)|
   |test-conda-python-3.8-pandas-1.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.8-pandas-1.0)](https://github.com/ursacomputing/crossbow/actions/runs/5464112251/jobs/9945699724)|
   |test-conda-python-3.8-spark-v3.1.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.8-spark-v3.1.2)](https://github.com/ursacomputing/crossbow/actions/runs/5464113088/jobs/9945702057)|
   |test-conda-python-3.9|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.9)](https://github.com/ursacomputing/crossbow/actions/runs/5464107852/jobs/9945690247)|
   |test-conda-python-3.9-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.9-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5464111900/jobs/9945698900)|
   |test-conda-python-3.9-spark-v3.2.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-conda-python-3.9-spark-v3.2.0)](https://github.com/ursacomputing/crossbow/actions/runs/5464111672/jobs/9945698457)|
   |test-cuda-python|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e8a97fb00e-github-test-cuda-python)](https://github.com/ursacomputing/crossbow/actions/runs/5464110109/jobs/9945695033)|
   |test-debian-11-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-e8a97fb00e-azure-test-debian-11-python-3)](https://github.com/ursacomputing/crossbow/runs/14792814225)|
   |test-fedora-35-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-e8a97fb00e-azure-test-fedora-35-python-3)](https://github.com/ursacomputing/crossbow/runs/14792804621)|
   |test-ubuntu-20.04-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-e8a97fb00e-azure-test-ubuntu-20.04-python-3)](https://github.com/ursacomputing/crossbow/runs/14792803919)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253067102


##########
python/pyarrow/_parquet.pyx:
##########
@@ -493,6 +493,16 @@ cdef class ColumnChunkMetaData(_Weakrefable):
         """Uncompressed size in bytes (int)."""
         return self.metadata.total_uncompressed_size()
 
+    @property
+    def has_offset_index(self):
+        """Has offset index"""

Review Comment:
   Ahah, I forgot about that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621366160

   Revision: ba4b65f685bce7c08e0e1ff9abe13524dd29c1e8
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-d0350e7bd5](https://github.com/ursacomputing/crossbow/branches/all?query=actions-d0350e7bd5)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.10)](https://github.com/ursacomputing/crossbow/actions/runs/5462591382/jobs/9942156310)|
   |test-conda-python-3.10-hdfs-2.9.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/5462598128/jobs/9942170520)|
   |test-conda-python-3.10-hdfs-3.2.1|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/5462591257/jobs/9942156155)|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5462592939/jobs/9942159416)|
   |test-conda-python-3.10-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.10-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/5462593421/jobs/9942160689)|
   |test-conda-python-3.10-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.10-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/5462597645/jobs/9942169439)|
   |test-conda-python-3.10-substrait|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.10-substrait)](https://github.com/ursacomputing/crossbow/actions/runs/5462591644/jobs/9942157183)|
   |test-conda-python-3.11|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.11)](https://github.com/ursacomputing/crossbow/actions/runs/5462597188/jobs/9942168431)|
   |test-conda-python-3.11-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5462594039/jobs/9942162091)|
   |test-conda-python-3.11-dask-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5462596586/jobs/9942167589)|
   |test-conda-python-3.11-hypothesis|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.11-hypothesis)](https://github.com/ursacomputing/crossbow/actions/runs/5462593727/jobs/9942161300)|
   |test-conda-python-3.11-pandas-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.11-pandas-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/5462595244/jobs/9942164456)|
   |test-conda-python-3.8|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.8)](https://github.com/ursacomputing/crossbow/actions/runs/5462593152/jobs/9942159947)|
   |test-conda-python-3.8-pandas-1.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.8-pandas-1.0)](https://github.com/ursacomputing/crossbow/actions/runs/5462592653/jobs/9942158786)|
   |test-conda-python-3.8-spark-v3.1.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.8-spark-v3.1.2)](https://github.com/ursacomputing/crossbow/actions/runs/5462594256/jobs/9942162456)|
   |test-conda-python-3.9|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.9)](https://github.com/ursacomputing/crossbow/actions/runs/5462596371/jobs/9942166747)|
   |test-conda-python-3.9-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.9-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/5462591881/jobs/9942157571)|
   |test-conda-python-3.9-spark-v3.2.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-conda-python-3.9-spark-v3.2.0)](https://github.com/ursacomputing/crossbow/actions/runs/5462596944/jobs/9942168048)|
   |test-cuda-python|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-d0350e7bd5-github-test-cuda-python)](https://github.com/ursacomputing/crossbow/actions/runs/5462597875/jobs/9942169897)|
   |test-debian-11-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-d0350e7bd5-azure-test-debian-11-python-3)](https://github.com/ursacomputing/crossbow/runs/14788312311)|
   |test-fedora-35-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-d0350e7bd5-azure-test-fedora-35-python-3)](https://github.com/ursacomputing/crossbow/runs/14788315640)|
   |test-ubuntu-20.04-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-d0350e7bd5-azure-test-ubuntu-20.04-python-3)](https://github.com/ursacomputing/crossbow/runs/14788306362)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #36290:
URL: https://github.com/apache/arrow/pull/36290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1254496266


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   Sure, I'll



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1254572259


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1599,6 +1610,14 @@ cdef shared_ptr[WriterProperties] _create_writer_properties(
     # a size larger than this then it will be latched to this value.
     props.max_row_group_length(_MAX_ROW_GROUP_SIZE)
 
+    # page index
+
+    if isinstance(write_page_index, bool):

Review Comment:
   You didn't change anything, did you?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1625279103

   @pitrou @jorisvandenbossche I've tried to fix the comment here. Would you mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1254470409


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   So if I understand correctly, we are currently not yet using the PageIndex when reading files (through the python APIs) for pruning pages when given a filter? 
   
   Should we mention that in the docstring to note that you can already write a PageIndex, but it will not yet be used when reading using pyarrow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1251612281


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False
+    Parquet format supports a page index that allows page index that allows
+    Reader to skipping reading pages of data. This option enables writing
+    the page index to the file.

Review Comment:
   ```suggestion
       Parquet format supports page index that makes filtering when
       reading more efficient. This option enables writing the page
       index to the Parquet file.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621360533

   @github-actions crossbow submit -g python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621590106

   * test-conda-python-3.10-spark-master
   * test-conda-python-3.8-spark-v3.1.2
   * test-conda-python-3.9-spark-v3.2.0
   
   Spark failures are known and have an issue opened.
   
   * test-conda-python-3.11-hypothesis
   
   Hypothesis failure is a new one but I do not see how it could be related to this PR.
   
   * test-cuda-python
   
   I have seen nightlies fail with this error today already, so this is not related to the PR either. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621589721

   @mapleFU Those are unrelated to this PR. Can you try to rebase?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621527696

   * test-conda-python-3.10-spark-master
   * test-cuda-python
   * test-conda-python-3.8-spark-v3.1.2
   * test-conda-python-3.10-spark-master
   
   These cases failed, how can I try to fix them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253022071


##########
python/pyarrow/_parquet.pyx:
##########
@@ -493,6 +493,16 @@ cdef class ColumnChunkMetaData(_Weakrefable):
         """Uncompressed size in bytes (int)."""
         return self.metadata.total_uncompressed_size()
 
+    @property
+    def has_offset_index(self):
+        """Has offset index"""

Review Comment:
   Also it would be nice to implement `has_index_page` and `index_page_offset`...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1621358063

   Can this patch be merged? Or should I wait for other committers review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253058005


##########
python/pyarrow/_parquet.pyx:
##########
@@ -493,6 +493,16 @@ cdef class ColumnChunkMetaData(_Weakrefable):
         """Uncompressed size in bytes (int)."""
         return self.metadata.total_uncompressed_size()
 
+    @property
+    def has_offset_index(self):
+        """Has offset index"""

Review Comment:
   To be short, `IndexPage` is not `PageIndex`.
   
   1. https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L536 Currently, index page turns to be something that user defined.
   2. PageIndex is a zonemap for pages
   
   So they are different things



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1253008522


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   Side question: should we consider making this turned on by default at some point?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1254499184


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   @jorisvandenbossche I've done that. By the way, we cannot filter using pyarrow, but `parquet-rs` and `parquet-mr` can optimize by it.



##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -357,6 +357,20 @@ def test_field_id_metadata():
     assert schema[5].metadata[field_id] == b'-1000'
 
 
+def test_parquet_file_page_index():
+    table = pa.table({'a': [1, 2, 3]})
+
+    writer = pa.BufferOutputStream()
+    _write_table(table, writer, write_page_index=True)
+    reader = pa.BufferReader(writer.getvalue())
+
+    # Can retrieve sorting columns from metadata
+    metadata = pq.read_metadata(reader)
+    cc = metadata.row_group(0).column(0)
+    assert cc.has_offset_index is True
+    assert cc.has_column_index is True

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1254474757


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1599,6 +1610,14 @@ cdef shared_ptr[WriterProperties] _create_writer_properties(
     # a size larger than this then it will be latched to this value.
     props.max_row_group_length(_MAX_ROW_GROUP_SIZE)
 
+    # page index
+
+    if isinstance(write_page_index, bool):

Review Comment:
   I think this pattern comes from the code above, where such isinstance check is also used. But that's for cases where the keyword could either be True/False or a dictionary to enable the option per column. 
   Here it's just a single True/False, so indeed the check is not needed. 
   
   I think we can simply leave out the `if isinstance` and just have the ``if write_page_index: ... else: ...``



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36290: GH-36284: [Python][Parquet] Support write page index in Python API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36290:
URL: https://github.com/apache/arrow/pull/36290#issuecomment-1606050717

   :warning: GitHub issue #36284 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org