You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "mosche (via GitHub)" <gi...@apache.org> on 2023/07/18 08:08:19 UTC

[GitHub] [beam] mosche opened a new pull request, #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

mosche opened a new pull request, #27534:
URL: https://github.com/apache/beam/pull/27534

   The conversion of rows to the external object is rather inefficient. The code to convert Row fields to the external field types required for the constructor / setters in `FromRowUsingCreator` is branch heavy and rather hard to optimize for Java's JIT compiler.
   
   ### Functional observations:
   
   Maps and Collections / Lists are currently handled inconsistently:
   - Maps are eagerly rewritten / copied, making their usage rather expensive. The user object will contain fully operable maps.
   - Collections / Lists are converted using Guava views (e.g. `Lists.transform`). The user object will contain collections that can mostly just be read from, THOUGH it's possible to remove items from the underlying collection potentially causing issues.
   
   
   ###  Changes in this PR:
   - On first usage, generate a converter function for every field based on schema & type information. No more branching required when applying the function. This way the code is also better JIT optimizable.
   - [TBD] Depending on types a conversion might not be required at all, in this case the conversion is skipped using an identity converter, e.g. this is the case for a map of primitive types. Though, passing references of maps / collections could potentially cause issues in certain situations (but this is already a given, see above).
   
   ### Benchmark results
   
   ![Screenshot 2023-07-17 at 17 27 53](https://github.com/apache/beam/assets/1401430/4d7abf25-c32c-436c-b3e5-c9f1fffa14bf)
   
   Historic benchmarks for `GetterBasedSchemaProvider` are available here (master branch only): https://s.apache.org/beam-community-metrics/d/kllfR2vVk/java-jmh-benchmarks?orgId=1
   
   Closes #27533
   
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] mosche commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "mosche (via GitHub)" <gi...@apache.org>.
mosche commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1645513915

   @bvolpato You're right, there's no dedicated tests for FromRowUsingCreator. However, test coverage is fairly decent by the subclasses mentioned above.
   
   Though, I just noticed protobuf, thrift & avro tests are not triggered as part of the precommit tests and looks like there's only a trigger phrase for avro :/ That's certainly a blocker.
   
   The other thing I've noticed during the implementation is that some features are not supported by all schemas, e.g. oneof / union types don't work with Avro.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1693261866

   Reminder, please take a look at this pr: @Abacn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] mosche commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "mosche (via GitHub)" <gi...@apache.org>.
mosche commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1639891894

   Run Java PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] mosche commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "mosche (via GitHub)" <gi...@apache.org>.
mosche commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1643309809

   Run Java PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] mosche commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "mosche (via GitHub)" <gi...@apache.org>.
mosche commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1645432153

   Run PostCommit_Java_Avro_Versions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] mosche commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "mosche (via GitHub)" <gi...@apache.org>.
mosche commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1641493230

   Run Java PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1639879144

   Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @bvolpato for label java.
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
   
   The PR bot will only process comments in the main thread (not review comments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1682178706

   Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @Abacn for label java.
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1639831609

   Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment `assign set of reviewers`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1675882715

   Reminder, please take a look at this pr: @kennknowles 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] bvolpato commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "bvolpato (via GitHub)" <gi...@apache.org>.
bvolpato commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1643833955

   Wow! That's awesome, great improvements. Nice work.
   
   I don't have any red flags, just slightly concerned if we have good test coverage around this change. It seems that we don't have unit tests for any of `FromRowUsingCreator` / `GetterBasedSchemaProvider`, but we rely on `GetterBasedSchemaProvider`'s subclasses to be tested (`AutoValueSchemaTest`, `AvroSchemaTest`, `ThriftSchemaTest`, `JavaFieldSchemaTest`, `JavaBeanSchemaTest`, `ProtoMessageSchemaTest`). 
   
   I think it should be fine, but I don't know if we cover all scenarios (I'll try to look at some of those tests).
   
   @chamikaramj Can you please take a look too?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] mosche commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "mosche (via GitHub)" <gi...@apache.org>.
mosche commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1639842723

   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1665514258

   Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @kennknowles for label java.
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] mosche merged pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "mosche (via GitHub)" <gi...@apache.org>.
mosche merged PR #27534:
URL: https://github.com/apache/beam/pull/27534


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #27534: [Java][Schemas] Improve performance of GetterBasedSchemaProvider#fromRowFunction

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #27534:
URL: https://github.com/apache/beam/pull/27534#issuecomment-1662101054

   Reminder, please take a look at this pr: @bvolpato 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org