You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/08/29 21:36:25 UTC

[GitHub] [beam] TheNeuralBit opened a new pull request, #22947: WIP: Add `arrow_type_compatibility` with `pyarrow.Table` `BatchConverter` implementation

TheNeuralBit opened a new pull request, #22947:
URL: https://github.com/apache/beam/pull/22947

   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] closed pull request #22947: WIP: Use `arrow_type_compatibility` in ParquetIO

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #22947: WIP: Use `arrow_type_compatibility` in ParquetIO
URL: https://github.com/apache/beam/pull/22947


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22947: WIP: Add `arrow_type_compatibility` with `pyarrow.Table` `BatchConverter` implementation

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on code in PR #22947:
URL: https://github.com/apache/beam/pull/22947#discussion_r960088478


##########
sdks/python/apache_beam/typehints/arrow_type_compatibility.py:
##########
@@ -0,0 +1,271 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Utilities for converting between Beam and Arrow schemas.
+
+For internal use only, no backward compatibility guarantees.
+"""
+
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+
+import pyarrow as pa
+
+from apache_beam.portability.api import schema_pb2
+from apache_beam.typehints.batch import BatchConverter
+from apache_beam.typehints.row_type import RowTypeConstraint
+from apache_beam.typehints.schemas import typing_from_runner_api
+from apache_beam.typehints.schemas import typing_to_runner_api
+from apache_beam.utils import proto_utils
+
+__all__ = []
+
+BEAM_SCHEMA_ID_KEY = b'beam:schema_id'
+BEAM_OPTION_KEY_PREFIX = b'beam:option:'
+
+
+def _hydrate_beam_option(encoded_option: bytes) -> schema_pb2.Option:
+  return proto_utils.parse_Bytes(encoded_option, schema_pb2.Option)
+
+
+def beam_schema_from_arrow_schema(arrow_schema: pa.Schema) -> schema_pb2.Schema:
+  if arrow_schema.metadata:
+    schema_id = arrow_schema.metadata.get(BEAM_SCHEMA_ID_KEY, None)
+    schema_options = [
+        _hydrate_beam_option(value) for key,
+        value in arrow_schema.metadata.items()
+        if key.startswith(BEAM_OPTION_KEY_PREFIX)
+    ]
+  else:
+    schema_id = None
+    schema_options = []
+
+  return schema_pb2.Schema(
+      fields=[
+          _beam_field_from_arrow_field(arrow_schema.field(i))
+          for i in range(len(arrow_schema.types))
+      ],
+      options=schema_options,
+      id=schema_id)
+
+
+def _beam_field_from_arrow_field(arrow_field: pa.Field) -> schema_pb2.Field:
+  beam_fieldtype = _beam_fieldtype_from_arrow_field(arrow_field)
+
+  if arrow_field.metadata:
+    field_options = [
+        _hydrate_beam_option(value) for key,
+        value in arrow_field.metadata.items()
+        if key.startswith(BEAM_OPTION_KEY_PREFIX)
+    ]
+  else:
+    field_options = None
+
+  return schema_pb2.Field(
+      name=arrow_field.name,
+      type=beam_fieldtype,
+      options=field_options,
+  )
+
+
+def _beam_fieldtype_from_arrow_field(
+    arrow_field: pa.Field) -> schema_pb2.FieldType:
+  beam_fieldtype = _beam_fieldtype_from_arrow_type(arrow_field.type)
+  beam_fieldtype.nullable = arrow_field.nullable
+
+  return beam_fieldtype
+
+
+def _beam_fieldtype_from_arrow_type(
+    arrow_type: pa.DataType) -> schema_pb2.FieldType:
+  if arrow_type in PYARROW_TO_ATOMIC_TYPE:
+    return schema_pb2.FieldType(atomic_type=PYARROW_TO_ATOMIC_TYPE[arrow_type])
+  elif isinstance(arrow_type, pa.ListType):
+    return schema_pb2.FieldType(
+        array_type=schema_pb2.ArrayType(
+            element_type=_beam_fieldtype_from_arrow_field(
+                arrow_type.value_field)))
+  elif isinstance(arrow_type, pa.MapType):
+    return schema_pb2.FieldType(
+        map_type=schema_pb2.MapType(
+            key_type=_beam_fieldtype_from_arrow_field(arrow_type.key_field),
+            value_type=_beam_fieldtype_from_arrow_field(arrow_type.item_field)))
+  elif isinstance(arrow_type, pa.StructType):
+    # TODO
+    pass
+  else:
+    raise ValueError(f"Unrecognized arrow type: {arrow_type!r}")
+
+
+def _option_as_arrow_metadata(
+    beam_option: schema_pb2.Option) -> Tuple[bytes, bytes]:
+  return (
+      BEAM_OPTION_KEY_PREFIX + beam_option.name.encode('UTF-8'),
+      beam_option.SerializeToString())
+
+
+def arrow_schema_from_beam_schema(beam_schema: schema_pb2.Schema) -> pa.Schema:
+  return pa.schema(
+      [_arrow_field_from_beam_field(field) for field in beam_schema.fields],
+      {
+          BEAM_SCHEMA_ID_KEY: beam_schema.id,
+          **dict(
+              _option_as_arrow_metadata(option) for option in beam_schema.options)  # pylint: disable=line-too-long

Review Comment:
   Noting here that pylint and yapf disagree on this line. I should file an issue tracking this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #22947: WIP: Use `arrow_type_compatibility` in ParquetIO

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #22947:
URL: https://github.com/apache/beam/pull/22947#issuecomment-1494257906

   This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #22947: WIP: Use `arrow_type_compatibility` in ParquetIO

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #22947:
URL: https://github.com/apache/beam/pull/22947#issuecomment-1484085766

   This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] codecov[bot] commented on pull request #22947: WIP: Add `arrow_type_compatibility` with `pyarrow.Table` `BatchConverter` implementation

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on PR #22947:
URL: https://github.com/apache/beam/pull/22947#issuecomment-1233260227

   # [Codecov](https://codecov.io/gh/apache/beam/pull/22947?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#22947](https://codecov.io/gh/apache/beam/pull/22947?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7ed5e7e) into [master](https://codecov.io/gh/apache/beam/commit/dbc6a466660283d2227162218f258b7a82ceca20?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (dbc6a46) will **increase** coverage by `0.00%`.
   > The diff coverage is `74.56%`.
   
   ```diff
   @@           Coverage Diff            @@
   ##           master   #22947    +/-   ##
   ========================================
     Coverage   73.66%   73.67%            
   ========================================
     Files         713      715     +2     
     Lines       94970    95166   +196     
   ========================================
   + Hits        69964    70116   +152     
   - Misses      23705    23749    +44     
     Partials     1301     1301            
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | python | `83.49% <74.56%> (-0.02%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/beam/pull/22947?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [.../apache\_beam/typehints/arrow\_batching\_benchmark.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdHlwZWhpbnRzL2Fycm93X2JhdGNoaW5nX2JlbmNobWFyay5weQ==) | `0.00% <0.00%> (ø)` | |
   | [sdks/python/apache\_beam/typehints/\_\_init\_\_.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdHlwZWhpbnRzL19faW5pdF9fLnB5) | `77.77% <66.66%> (-22.23%)` | :arrow_down: |
   | [.../apache\_beam/typehints/arrow\_type\_compatibility.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdHlwZWhpbnRzL2Fycm93X3R5cGVfY29tcGF0aWJpbGl0eS5weQ==) | `88.88% <88.88%> (ø)` | |
   | [sdks/python/apache\_beam/io/parquetio.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vaW8vcGFycXVldGlvLnB5) | `95.34% <96.55%> (+0.18%)` | :arrow_up: |
   | [sdks/python/apache\_beam/typehints/batch.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdHlwZWhpbnRzL2JhdGNoLnB5) | `89.74% <100.00%> (+1.35%)` | :arrow_up: |
   | [.../python/apache\_beam/testing/test\_stream\_service.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdGVzdGluZy90ZXN0X3N0cmVhbV9zZXJ2aWNlLnB5) | `88.09% <0.00%> (-4.77%)` | :arrow_down: |
   | [...che\_beam/runners/interactive/interactive\_runner.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9pbnRlcmFjdGl2ZS9pbnRlcmFjdGl2ZV9ydW5uZXIucHk=) | `90.06% <0.00%> (-1.33%)` | :arrow_down: |
   | [sdks/python/apache\_beam/typehints/schemas.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdHlwZWhpbnRzL3NjaGVtYXMucHk=) | `93.84% <0.00%> (-0.31%)` | :arrow_down: |
   | [sdks/python/apache\_beam/portability/common\_urns.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcG9ydGFiaWxpdHkvY29tbW9uX3VybnMucHk=) | `100.00% <0.00%> (ø)` | |
   | [...pi/org/apache/beam/model/pipeline/v1/schema\_pb2.py](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcG9ydGFiaWxpdHkvYXBpL29yZy9hcGFjaGUvYmVhbS9tb2RlbC9waXBlbGluZS92MS9zY2hlbWFfcGIyLnB5) | `100.00% <0.00%> (ø)` | |
   | ... and [6 more](https://codecov.io/gh/apache/beam/pull/22947/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org