You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "robertwb (via GitHub)" <gi...@apache.org> on 2023/09/15 20:39:43 UTC

[GitHub] [beam] robertwb opened a new pull request, #28486: Add schema-aware text file reading and writing.

robertwb opened a new pull request, #28486:
URL: https://github.com/apache/beam/pull/28486

   **Please** add a meaningful description for your change here
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] codecov[bot] commented on pull request #28486: Add schema-aware text file reading and writing.

Posted by "codecov[bot] (via GitHub)" <gi...@apache.org>.
codecov[bot] commented on PR #28486:
URL: https://github.com/apache/beam/pull/28486#issuecomment-1721947209

   ## [Codecov](https://app.codecov.io/gh/apache/beam/pull/28486?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   > Merging [#28486](https://app.codecov.io/gh/apache/beam/pull/28486?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (ee4278f) into [master](https://app.codecov.io/gh/apache/beam/commit/1f980eaa894cc43ea5ca1aeb4cb2ef1de1162b17?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (1f980ea) will **increase** coverage by `0.00%`.
   > Report is 1 commits behind head on master.
   > The diff coverage is `35.71%`.
   
   ```diff
   @@           Coverage Diff           @@
   ##           master   #28486   +/-   ##
   =======================================
     Coverage   72.22%   72.22%           
   =======================================
     Files         684      684           
     Lines      100853   100868   +15     
   =======================================
   + Hits        72840    72854   +14     
   - Misses      26436    26437    +1     
     Partials     1577     1577           
   ```
   
   | [Flag](https://app.codecov.io/gh/apache/beam/pull/28486/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [python](https://app.codecov.io/gh/apache/beam/pull/28486/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `82.82% <35.71%> (+<0.01%)` | :arrow_up: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Files Changed](https://app.codecov.io/gh/apache/beam/pull/28486?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [sdks/python/apache\_beam/yaml/yaml\_io.py](https://app.codecov.io/gh/apache/beam/pull/28486?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0veWFtbC95YW1sX2lvLnB5) | `40.00% <35.71%> (ø)` | |
   
   ... and [4 files with indirect coverage changes](https://app.codecov.io/gh/apache/beam/pull/28486/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb commented on a diff in pull request #28486: Add schema-aware text file reading and writing.

Posted by "robertwb (via GitHub)" <gi...@apache.org>.
robertwb commented on code in PR #28486:
URL: https://github.com/apache/beam/pull/28486#discussion_r1330727380


##########
sdks/python/apache_beam/yaml/yaml_io.py:
##########
@@ -28,12 +28,38 @@
 import yaml
 
 import apache_beam as beam
+import apache_beam.io as beam_io
 from apache_beam.io import ReadFromBigQuery
 from apache_beam.io import WriteToBigQuery
 from apache_beam.io.gcp.bigquery import BigQueryDisposition
+from apache_beam.typehints.schemas import named_fields_from_element_type
 from apache_beam.yaml import yaml_provider
 
 
+def read_from_text(path: str):
+  # TODO(yaml): Consider passing the filename and offset, possibly even
+  # by default.
+  return beam_io.ReadFromText(path) | beam.Map(lambda s: beam.Row(line=s))
+
+
+@beam.ptransform_fn
+def write_to_text(pcoll, path: str):
+  try:
+    field_names = [
+        name for name, _ in named_fields_from_element_type(pcoll.element_type)
+    ]
+  except Exception as exn:
+    raise ValueError(
+        "WriteToText requires an input schema with exactly one field.") from exn
+  if len(field_names) != 1:
+    raise ValueError(
+        "WriteToText requires an input schema with exactly one field, got %s" %
+        field_names)
+  sole_field_name, = field_names
+  return pcoll | beam.Map(
+      lambda x: str(getattr(x, sole_field_name))) | beam.io.WriteToText(path)

Review Comment:
   One may want to be able to specify a general suffix, not just an extension, and maybe other sharding parameters (like the shard format). I think we'll want to add this in a consistent way to all file output types. I'm not confident enough as to what that'll look like to get something in right now though, and it is something additive. (I don't think it should be required, but perhaps could see using `.txt` as a default and having to override it with an empty string to get nothing. TBD)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on a diff in pull request #28486: Add schema-aware text file reading and writing.

Posted by "damccorm (via GitHub)" <gi...@apache.org>.
damccorm commented on code in PR #28486:
URL: https://github.com/apache/beam/pull/28486#discussion_r1331651431


##########
sdks/python/apache_beam/yaml/yaml_io.py:
##########
@@ -28,12 +28,38 @@
 import yaml
 
 import apache_beam as beam
+import apache_beam.io as beam_io
 from apache_beam.io import ReadFromBigQuery
 from apache_beam.io import WriteToBigQuery
 from apache_beam.io.gcp.bigquery import BigQueryDisposition
+from apache_beam.typehints.schemas import named_fields_from_element_type
 from apache_beam.yaml import yaml_provider
 
 
+def read_from_text(path: str):
+  # TODO(yaml): Consider passing the filename and offset, possibly even
+  # by default.
+  return beam_io.ReadFromText(path) | beam.Map(lambda s: beam.Row(line=s))
+
+
+@beam.ptransform_fn
+def write_to_text(pcoll, path: str):
+  try:
+    field_names = [
+        name for name, _ in named_fields_from_element_type(pcoll.element_type)
+    ]
+  except Exception as exn:
+    raise ValueError(
+        "WriteToText requires an input schema with exactly one field.") from exn
+  if len(field_names) != 1:
+    raise ValueError(
+        "WriteToText requires an input schema with exactly one field, got %s" %
+        field_names)
+  sole_field_name, = field_names
+  return pcoll | beam.Map(
+      lambda x: str(getattr(x, sole_field_name))) | beam.io.WriteToText(path)

Review Comment:
   > but perhaps could see using .txt as a default
   
   That is still technically breaking fwiw (though I think its fine to do at this stage)
   
   > One may want to be able to specify a general suffix, not just an extension, and maybe other sharding parameters (like the shard format). I think we'll want to add this in a consistent way to all file output types. I'm not confident enough as to what that'll look like to get something in right now though, and it is something additive.
   
   I generally agree, though I think I am very confident we will want to allow folks to specify a suffix or extension (naming it suffix instead of extension is fine, though I think the latter is more intuitive for a potentially less technical audience).
   
   Regardless, I am ok leaving this for now since I think getting something before the cut is worthwhile



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb commented on a diff in pull request #28486: Add schema-aware text file reading and writing.

Posted by "robertwb (via GitHub)" <gi...@apache.org>.
robertwb commented on code in PR #28486:
URL: https://github.com/apache/beam/pull/28486#discussion_r1330727638


##########
sdks/python/apache_beam/yaml/yaml_io.py:
##########
@@ -28,12 +28,38 @@
 import yaml
 
 import apache_beam as beam
+import apache_beam.io as beam_io
 from apache_beam.io import ReadFromBigQuery
 from apache_beam.io import WriteToBigQuery
 from apache_beam.io.gcp.bigquery import BigQueryDisposition
+from apache_beam.typehints.schemas import named_fields_from_element_type
 from apache_beam.yaml import yaml_provider
 
 
+def read_from_text(path: str):
+  # TODO(yaml): Consider passing the filename and offset, possibly even
+  # by default.

Review Comment:
   Yep, exactly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #28486: Add schema-aware text file reading and writing.

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #28486:
URL: https://github.com/apache/beam/pull/28486#issuecomment-1721844124

   Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on a diff in pull request #28486: Add schema-aware text file reading and writing.

Posted by "damccorm (via GitHub)" <gi...@apache.org>.
damccorm commented on code in PR #28486:
URL: https://github.com/apache/beam/pull/28486#discussion_r1330700743


##########
sdks/python/apache_beam/yaml/yaml_io.py:
##########
@@ -28,12 +28,38 @@
 import yaml
 
 import apache_beam as beam
+import apache_beam.io as beam_io
 from apache_beam.io import ReadFromBigQuery
 from apache_beam.io import WriteToBigQuery
 from apache_beam.io.gcp.bigquery import BigQueryDisposition
+from apache_beam.typehints.schemas import named_fields_from_element_type
 from apache_beam.yaml import yaml_provider
 
 
+def read_from_text(path: str):
+  # TODO(yaml): Consider passing the filename and offset, possibly even
+  # by default.

Review Comment:
   To be clear, you're saying pass them as fields in the returned beam.Row? I'm +1 on optionally doing that in the future FWIW (it would maybe be generally useful for ReadFromText) - in particular, getting the filename would likely be helpful in some use cases



##########
sdks/python/apache_beam/yaml/yaml_io.py:
##########
@@ -28,12 +28,38 @@
 import yaml
 
 import apache_beam as beam
+import apache_beam.io as beam_io
 from apache_beam.io import ReadFromBigQuery
 from apache_beam.io import WriteToBigQuery
 from apache_beam.io.gcp.bigquery import BigQueryDisposition
+from apache_beam.typehints.schemas import named_fields_from_element_type
 from apache_beam.yaml import yaml_provider
 
 
+def read_from_text(path: str):
+  # TODO(yaml): Consider passing the filename and offset, possibly even
+  # by default.
+  return beam_io.ReadFromText(path) | beam.Map(lambda s: beam.Row(line=s))
+
+
+@beam.ptransform_fn
+def write_to_text(pcoll, path: str):
+  try:
+    field_names = [
+        name for name, _ in named_fields_from_element_type(pcoll.element_type)
+    ]
+  except Exception as exn:
+    raise ValueError(
+        "WriteToText requires an input schema with exactly one field.") from exn
+  if len(field_names) != 1:
+    raise ValueError(
+        "WriteToText requires an input schema with exactly one field, got %s" %
+        field_names)
+  sole_field_name, = field_names
+  return pcoll | beam.Map(
+      lambda x: str(getattr(x, sole_field_name))) | beam.io.WriteToText(path)

Review Comment:
   Should we take a (required?) `file_extension` parameter as well? Right now, this will output files like: `<path>-<shard_id>`, but I'd guess most people will want `<path>-<shard_id>.<extension>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm merged pull request #28486: Add schema-aware text file reading and writing.

Posted by "damccorm (via GitHub)" <gi...@apache.org>.
damccorm merged PR #28486:
URL: https://github.com/apache/beam/pull/28486


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] robertwb commented on pull request #28486: Add schema-aware text file reading and writing.

Posted by "robertwb (via GitHub)" <gi...@apache.org>.
robertwb commented on PR #28486:
URL: https://github.com/apache/beam/pull/28486#issuecomment-1721842447

   R: @Polber


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org