You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "sjperkins (via GitHub)" <gi...@apache.org> on 2023/03/07 11:01:28 UTC

[GitHub] [arrow] sjperkins opened a new pull request, #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

sjperkins opened a new pull request, #34483:
URL: https://github.com/apache/arrow/pull/34483

   Alternative implementation to https://github.com/apache/arrow/pull/34469
   
   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Explained in greater detail in https://github.com/apache/arrow/issues/33997, but my summary follows:
   
   * C++ Extension Types currently are not well-exposed in the Python Layer
   * In particular, generic `ExtensionArray` and `ExtensionScalar` types are associated with C++ Extensions exposed through `BaseExtensionType`, rather than hand-coded Python types (which would be laborious to generate).
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   * When `BaseExtensionType.__arrow_ext_class__()` and `BaseExtensionType.__arrow_ext_scalar_class__()` are called, associated Python classes are created on the fly and cached for future calls.
   * `BaseExtensionType.__arrow_ext_serialize__()` and `BaseExtensionType.__arrow_ext_deserialize__()` defer to `CExtensionType.Serialize()` and `CExtensionType.Deserialize()` respectively.
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   * `__arrow_ext_class__()` and `__arrow_ext_scalar_class__()` are tested
   * `__arrow_ext_serialize__()` and `__arrow_ext_deserialize__()` are not tested yet. I'd appreciate feedback on whether the general approach is useful going forward before adding tests.
   
   ### Are there any user-facing changes?
   
   * I don't believe so.
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->
   * Closes: #33997 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491459870

   > Yes, the UI didn't seem to give me the option to reopen the PR.
   
   But I do have the ability to close the PR..
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491477330

   @jorisvandenbossche Regarding this issue and #33997, they're somewhat exploratory because they dynamically generate Python types.
   
   I'm interested in whether these strategies would be a useful way of exposing pure C++ Extension Types in Python.
   
   For instance, `{from,to}_numpy()` and `{from,to}_pandas` could then be added to these types.
   
   However, perhaps this should be done through the PyExtensionType mechanisms (both Python and C++)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins closed pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins closed pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types
URL: https://github.com/apache/arrow/pull/34483


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins closed pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins closed pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types
URL: https://github.com/apache/arrow/pull/34483


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491681707

   OK, that's good to know, thanks for checking!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1490666850

   Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491537307

   > For instance, {from,to}_numpy() and {from,to}_pandas could then be added to these types.
   
   Some further context to motivate for this: It would be useful to efficiently convert nested FixedSizeListArray's into numpy arrays. The default produces deeply nested numpy arrays with `object` dtype. See https://github.com/ratt-ru/casa-arrow/issues/2 for example.
   
   This may also be relevant to #8510 which, from imperfect memory, uses FixedSizeListArray's to represent Tensors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491702391

   > I'm interested in whether these strategies would be a useful way of exposing pure C++ Extension Types in Python.
   
   I am not a huge fan of the idea of creating classes on the fly .. Also, does this give something more useful than wrapping it just in a base class as is done now? (because right now this generated class also doesn't have any extension-type specific logic?)
   
   > Some further context to motivate for this: It would be useful to efficiently convert nested FixedSizeListArray's into numpy arrays. 
   
   Yeah, so if we have a way to register a python class to use as the type class for an extension type implemented in C++ (https://github.com/apache/arrow/issues/33997), you can override the `to_numpy()` method. However, than only works if you call this method directly on the array object. But if you convert a table, or a chunked array, it will still go through the C++ layer which currently falls back on converting the storage array. So we might need to think about a more general mechanism here to tap into this conversion logic.
   
   
   > This may also be relevant to #8510 which, from imperfect memory, uses FixedSizeListArray's to represent Tensors.
   
   Yes, that one is close to being merged, and then we can expose this in python, which might be a useful exercise to see how this goes / what we can learn from this (but of course it's an internal one, so we can hard code support for it)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1494231406

   Closing because the use of dynamic types is a bit too exotic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche closed pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche closed pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types
URL: https://github.com/apache/arrow/pull/34483


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- closed pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- closed pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types
URL: https://github.com/apache/arrow/pull/34483


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1490736334

   Yes of course, the cleaning was done exactly to surface the prs that still needed attention


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491429386

   @sjperkins sorry about closing all your PRs, and about the slow review here. I will try to get back to this one and https://github.com/apache/arrow/pull/34469 shortly.
   
   Other question: you couldn't reopen this PR yourself? I mean, did the GitHub UI allow that? Because if only committers can reopen PRs, that would be useful to know because the messaging should be different then if we would do some auto-closing in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491797111

   > I am not a huge fan of the idea of creating classes on the fly .. Also, does this give something more useful than wrapping it just in a base class as is done now? (because right now this generated class also doesn't have any extension-type specific logic?)
   
   Yes, I understand the suggestion is a bit exotic. I think what it gives is the ability to associate specific Python classes with C++ Extension Types/Arrays  and attach (mainly) `{from,to}_numpy` functionality to the Python class. There probably are nicer ways to do it, but nothing comes immediately to mind. Perhaps there's a way to do this with the existing C++ PyExtensionType machinery.
   
   > Yeah, so if we have a way to register a python class to use as the type class for an extension type implemented in C++ (#33997), you can override the `to_numpy()` method. However, than only works if you call this method directly on the array object. But if you convert a table, or a chunked array, it will still go through the C++ layer which currently falls back on converting the storage array. So we might need to think about a more general mechanism here to tap into this conversion logic.
   
   Thanks for pointing this out, this gives me a broader perspective on what would be useful.
   
   > 
   > > This may also be relevant to #8510 which, from imperfect memory, uses FixedSizeListArray's to represent Tensors.
   > 
   > Yes, that one is close to being merged, and then we can expose this in python, which might be a useful exercise to see how this goes / what we can learn from this (but of course it's an internal one, so we can hard code support for it)
   
   I'll think about this a bit more. let me know if there're approaches you think would be useful to experiment with.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491468130

   > Yes, the UI didn't seem to give me the option to reopen the PR.
   
   OK, this does seem possible if I close the PR. Apologies I responded late yesterday evening, so I may have conflated the force push in #33805 or missed the button in this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1457970646

   * Closes: #33801


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491670384

   > Can you reopen it now? (it might depend on whether you closed it yourself)
   
   Ah I cannot. I think it does depend on the closer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1490693885

   Would it be possible to re-open this issue? I'm awaiting review from Arrow maintainers here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491646664

   Can you reopen it now? (it might depend on whether you closed it yourself)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on pull request #34483: GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on PR #34483:
URL: https://github.com/apache/arrow/pull/34483#issuecomment-1491459193

   > @sjperkins sorry about closing all your PRs, and about the slow review here. I will try to get back to this one and #34469 shortly.
   
   No problem! I assumed that the upcoming Arrow 12.0.0 release was important.
   
   > Other question: you couldn't reopen this PR yourself? I mean, did the GitHub UI allow that? Because if only committers can reopen PRs, that would be useful to know because the messaging should be different then if we would do some auto-closing in the future.
   
   Yes, the UI didn't seem to give me the option to reopen the PR.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org