You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/02/01 14:11:27 UTC

[GitHub] [arrow] Fokko opened a new pull request, #33974: GH-33973: [PYTHON] Update parquet filter javadoc

Fokko opened a new pull request, #33974:
URL: https://github.com/apache/arrow/pull/33974

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   I wrote a converter from an arbitrary expression to DNF, but this was not needed after learning that it just accepts an expression now.
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   ### Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter keyword

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1431394875

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/6192492c830e4751bc915622a64b33c1...c216c5e89fc64cb78d9abc8bfdefeb79/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Fokko commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1413867923

   @westonpace 
   
   > The docs you probably want are [pyarrow.dataset.dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset) and [pyarrow.dataset.Dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset)
   
   I would agree with you, but I went for a lower API because I want to re-use the connection. More background is provided [here](https://github.com/apache/arrow/issues/33972), and the dataset only accepts paths. If you think it is worthwhile to accept `NativeFile` there as well, let me know, and happy to raise a PR.
   
   > which unfortunately redirects its documentation to [pyarrow.dataset.Scanner.from_dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_dataset) 😰 )
   
   Do you want me to create a PR to copy those docs? Because of the redirect, the arguments are also not showing up in PyCharm. Let me know and I'll create a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Fokko commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1423899445

   @jorisvandenbossche Good call. I've updated the docstring and added an example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Fokko commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1416139478

   @jorisvandenbossche first a bit of cleanup in https://github.com/apache/arrow/pull/34034


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1414505820

   > Do you want me to create a PR to copy those docs? Because of the redirect, the arguments are also not showing up in PyCharm. Let me know and I'll create a PR.
   
   Yes, I think that is a good idea but I might CC @jorisvandenbossche or @amol- to weigh in on whether they know some better way to avoid the duplication or have a preference here (I normally focus on the C++ end of things.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter keyword

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1431393442

   Benchmark runs are scheduled for baseline = e63215ca1b3c9f5f40311a9ff67f0f286cd142c8 and contender = 306026d8611e4bd9676788f6e5a195cae8dc8610. 306026d8611e4bd9676788f6e5a195cae8dc8610 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/8d445d3f381c486c95c443a8e71be999...4517d216d3604cf2b36b89e7d7625c1e/)
   [Failed :arrow_down:0.58% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/6192492c830e4751bc915622a64b33c1...c216c5e89fc64cb78d9abc8bfdefeb79/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/bd8437fc474548a7b5b65a3f9f590ca5...1acc896a38244a8dab43c1a53a64a37a/)
   [Finished :arrow_down:0.16% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/86cac072b9db45b8bf518f00e7eb6bc2...9ac83056e48c41d38b31896b0f389ccb/)
   Buildkite builds:
   [Finished] [`306026d8` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2375)
   [Failed] [`306026d8` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2405)
   [Finished] [`306026d8` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2373)
   [Finished] [`306026d8` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2397)
   [Finished] [`e63215ca` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2374)
   [Failed] [`e63215ca` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2404)
   [Finished] [`e63215ca` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2372)
   [Finished] [`e63215ca` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2396)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Fokko commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1430467706

   @jorisvandenbossche WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1416097157

   @Fokko thanks for the catch! You are currently updating the docstring of ParquetDataset, but we should do the same update for read_table:
   
   https://github.com/apache/arrow/blob/78a8da42d1fc364e84286afc88a25fa5b819d092/python/pyarrow/parquet/core.py#L2775
   
   (I think it's also `read_table` you are using in PyIceberg, and not `ParquetDataset`?)
   
   
   
   
   > > Do you want me to create a PR to copy those docs? Because of the redirect, the arguments are also not showing up in PyCharm. Let me know and I'll create a PR.
   > 
   > Yes, I think that is a good idea but I might CC @jorisvandenbossche or @amol- to weigh in on whether they know some better way to avoid the duplication or have a preference here (I normally focus on the C++ end of things.)
   
   Yeah, I think we should prefer some duplication if that gives better docstrings. I agree the indirection for the user right now isn't very user friendly. 
   We might be able to share some part of the docstring and inject that in multiple places to avoid duplicating the actual content, if that doesn't make things too complicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche merged pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter keyword

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche merged PR #33974:
URL: https://github.com/apache/arrow/pull/33974


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1421003013

   One more thing about the docstring: for both read_table and ParquetDataset, the docstring also injects the content of `_DNF_filter_doc` (which holds the actual explanation of how the filter is expressed in this list of tuples form). We should probably update that to start with saying that the filter can either be a pyarrow Expression, or this DNF list of tuples form.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1413802226

   I don't have a problem with this change but wonder if the root cause might be a little more fundamental.  I wonder if `ParquetDataset` itself should be deprecated.  The docs you probably want are [pyarrow.dataset.dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset) and [pyarrow.dataset.Dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset) (though filter would be provided by [pyarrow.dataset.Dataset.to_table](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_table) which unfortunately redirects its documentation to [pyarrow.dataset.Scanner.from_dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_dataset) :cold_sweat: ) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Fokko commented on pull request #33974: GH-33973: [Python][Docs] Update documentation for Parquet filter

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #33974:
URL: https://github.com/apache/arrow/pull/33974#issuecomment-1416108262

   @jorisvandenbossche Updating table one is a good suggestion indeed. Updated that one as well 👍🏻 
   
   > (I think it's also read_table you are using in PyIceberg, and not ParquetDataset?)
   
   Currently, [we use the `read_table` indeed](https://github.com/apache/iceberg/blob/master/python/pyiceberg/io/pyarrow.py#L510-L517). I've also played around with the ParquetDataset, and it looked very similar. However, we don't need the lazy nature of the dataset, so directly loading a table makes more sense in our situation.
   
   > Yeah, I think we should prefer some duplication if that gives better docstrings. I agree the indirection for the user right now isn't very user friendly.
   > We might be able to share some part of the docstring and inject that in multiple places to avoid duplicating the actual content, if that doesn't make things too complicated.
   
   Makes a lot of sense, I think sharing would be best. Let me create a separate PR for that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org