You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/08 23:32:47 UTC

[GitHub] [beam] robertwb opened a new pull request, #21762: Better cross langauge support for dataframe reads.

robertwb opened a new pull request, #21762:
URL: https://github.com/apache/beam/pull/21762

   Adds a new Read/WriteViaPandas transform and better
   support for object types that are actually strings.
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Add a link to the appropriate issue in your description, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb commented on pull request #21762: Better cross langauge support for dataframe reads.

Posted by GitBox <gi...@apache.org>.
robertwb commented on PR #21762:
URL: https://github.com/apache/beam/pull/21762#issuecomment-1151372930

   OK, I've made this local to the transform. I agree that this is cleaner, thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on pull request #21762: Better cross langauge support for dataframe reads.

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on PR #21762:
URL: https://github.com/apache/beam/pull/21762#issuecomment-1150534792

   > and convert_dtypes requires looking at the entire PCollection to figure out the proxy object
   
   Right, I know we can't do `convert_dtypes` exactly, I just meant we could add something similar to it, that assumes all object columns are coercible to strings, and raises error at execution time if they're not.
   
   I think doing that conversion explicitly with the DataFrame API would be preferable to plumbing the typehint through the schema code. But maybe that uglies things up to have to do `read_csv().coerce_objects_to_strings()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on pull request #21762: Better cross langauge support for dataframe reads.

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on PR #21762:
URL: https://github.com/apache/beam/pull/21762#issuecomment-1150521965

   What happens for object columns that aren't actually strings though? I wonder if a better approach would be to make sure that we get `StringDType` columns in cross-language contexts.
   
   Unfortunately I don't see a way to configure read_csv to get StringDType out. Maybe could add something akin to [`convert_dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html) that will attempt to coerce all object columns to strings (and raise an execution time error if it's impossible).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb merged pull request #21762: Better cross langauge support for dataframe reads.

Posted by GitBox <gi...@apache.org>.
robertwb merged PR #21762:
URL: https://github.com/apache/beam/pull/21762


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb commented on pull request #21762: Better cross langauge support for dataframe reads.

Posted by GitBox <gi...@apache.org>.
robertwb commented on PR #21762:
URL: https://github.com/apache/beam/pull/21762#issuecomment-1150514971

   R: @TheNeuralBit 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] codecov[bot] commented on pull request #21762: Better cross langauge support for dataframe reads.

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on PR #21762:
URL: https://github.com/apache/beam/pull/21762#issuecomment-1150524673

   # [Codecov](https://codecov.io/gh/apache/beam/pull/21762?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#21762](https://codecov.io/gh/apache/beam/pull/21762?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (108f576) into [master](https://codecov.io/gh/apache/beam/commit/edddbaa5f27c1492b108afe1baecf3fd08be9554?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (edddbaa) will **increase** coverage by `0.00%`.
   > The diff coverage is `60.71%`.
   
   ```diff
   @@           Coverage Diff           @@
   ##           master   #21762   +/-   ##
   =======================================
     Coverage   74.02%   74.02%           
   =======================================
     Files         698      698           
     Lines       92134    92212   +78     
   =======================================
   + Hits        68203    68264   +61     
   - Misses      22680    22697   +17     
     Partials     1251     1251           
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | python | `83.60% <60.71%> (-0.01%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/beam/pull/21762?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [sdks/python/apache\_beam/io/iobase.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vaW8vaW9iYXNlLnB5) | `86.25% <ø> (ø)` | |
   | [sdks/python/apache\_beam/dataframe/io.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vZGF0YWZyYW1lL2lvLnB5) | `89.90% <41.17%> (-2.13%)` | :arrow_down: |
   | [sdks/python/apache\_beam/dataframe/schemas.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vZGF0YWZyYW1lL3NjaGVtYXMucHk=) | `96.92% <90.00%> (-0.72%)` | :arrow_down: |
   | [sdks/python/apache\_beam/dataframe/convert.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vZGF0YWZyYW1lL2NvbnZlcnQucHk=) | `90.36% <100.00%> (ø)` | |
   | [...hon/apache\_beam/runners/worker/bundle\_processor.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy93b3JrZXIvYnVuZGxlX3Byb2Nlc3Nvci5weQ==) | `93.42% <0.00%> (-0.25%)` | :arrow_down: |
   | [...thon/apache\_beam/ml/inference/sklearn\_inference.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vbWwvaW5mZXJlbmNlL3NrbGVhcm5faW5mZXJlbmNlLnB5) | `92.50% <0.00%> (-0.19%)` | :arrow_down: |
   | [...ks/python/apache\_beam/runners/worker/sdk\_worker.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy93b3JrZXIvc2RrX3dvcmtlci5weQ==) | `88.94% <0.00%> (-0.16%)` | :arrow_down: |
   | [...thon/apache\_beam/ml/inference/pytorch\_inference.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vbWwvaW5mZXJlbmNlL3B5dG9yY2hfaW5mZXJlbmNlLnB5) | `0.00% <0.00%> (ø)` | |
   | [sdks/python/apache\_beam/ml/inference/base.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vbWwvaW5mZXJlbmNlL2Jhc2UucHk=) | `93.70% <0.00%> (+0.04%)` | :arrow_up: |
   | [sdks/python/apache\_beam/runners/direct/executor.py](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9kaXJlY3QvZXhlY3V0b3IucHk=) | `97.01% <0.00%> (+0.54%)` | :arrow_up: |
   | ... and [5 more](https://codecov.io/gh/apache/beam/pull/21762/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/beam/pull/21762?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/beam/pull/21762?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [edddbaa...108f576](https://codecov.io/gh/apache/beam/pull/21762?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb commented on pull request #21762: Better cross langauge support for dataframe reads.

Posted by GitBox <gi...@apache.org>.
robertwb commented on PR #21762:
URL: https://github.com/apache/beam/pull/21762#issuecomment-1274256193

   I was thinking this could be more useful for other cases where the dtype(s)
   could not be inferred as well, though I can see how that would be sketchy
   as well. Would you prefer we add a new coerce_objects_to_strings operation
   to deferred dataframes? (Or I suppose I could do it manually using
   DataFrame.astype iterating over the columns.) Let me see what that looks
   like.
   
   On Wed, Jun 8, 2022 at 5:11 PM Brian Hulette ***@***.***>
   wrote:
   
   > and convert_dtypes requires looking at the entire PCollection to figure
   > out the proxy object
   >
   > Right, I know we can't do convert_dtypes exactly, I just meant we could
   > add something similar to it, that assumes all object columns are coercible
   > to strings, and raises error at execution time if they're not.
   >
   > I think doing that conversion explicitly with the DataFrame API would be
   > preferable to plumbing the typehint through the schema code. But maybe that
   > uglies things up to have to do read_csv().coerce_objects_to_strings()?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/beam/pull/21762#issuecomment-1150534792>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADWVAKDPEKWGVBX75QCXWLVOEZDPANCNFSM5YIF2LIQ>
   > .
   > You are receiving this because you authored the thread.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb commented on pull request #21762: Better cross langauge support for dataframe reads.

Posted by GitBox <gi...@apache.org>.
robertwb commented on PR #21762:
URL: https://github.com/apache/beam/pull/21762#issuecomment-1150530866

   If the columns aren't actually strings then an exception will be raised when trying to encode them at strings. I wasn't able to figure out how to get StringDType out of csv either, and convert_dtypes requires looking at the entire PCollection to figure out the proxy object. I think this is pretty safe because objects should be strings immediately following a csv (or json, or most other common types) read. If it's not, exposing this in a cross-langauge way would be hard anyway. (I suppose one may encounter the bytes type, and nested types (though IIRC pandas doesn't do to well with that anyway).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org