You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/03/22 19:21:26 UTC

[GitHub] [beam] chunyang opened a new pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

chunyang opened a new pull request #17153:
URL: https://github.com/apache/beam/pull/17153


   Second attempt to avoid storing generator in _CustomBigQuerySource. See description from last attempt (#17100) below.
   
   This time, export table schema is retrieved before the temp dataset is deleted. Because TableSchema is not picklable (see https://github.com/apache/beam/pull/17100#issuecomment-1075505398), I convert it into a JSON string before storing it as an instance variable.
   
   ------------------------
   Avoid storing a generator in _CustomBigQuerySource so that it can be used with the Python interactive runner. See details in https://issues.apache.org/jira/browse/BEAM-14112 .
   
   Likely related to https://github.com/apache/beam/pull/15610
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chunyang commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chunyang commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1075543224


   R: @KevinGG 
   R: @chamikaramj 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] KevinGG commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
KevinGG commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1076678937


   
   > Just to explore some other options other than converting schema to JSON, prior to #15610, there was no generator or TableSchema to cause pickling errors.
   > 
   > Instead of storing the generator as an instance attribute, a list of [TextSource](https://github.com/apache/beam/blob/v2.34.0/sdks/python/apache_beam/io/gcp/bigquery.py#L793-L798) and notably its `coder` attribute were stored instead (we can assume `use_json_exports=True` for this discussion). The default `coder`, `_JsonToDictCoder`, had a method [`_convert_to_tuple`](https://github.com/apache/beam/blob/v2.34.0/sdks/python/apache_beam/io/gcp/bigquery_read_internal.py#L401-L413) to marshal the TableSchema into an object more amenable to pickling.
   > 
   > Perhaps I can use the same `_convert_to_tuple` method to create a picklable version of TableSchema and store that as an attribute rather than going the JSON route?
   
   How about we store the coders directly in the `_BigQueryExportResult` class? Since we don't really need the TableSchema later but a coder built from the TableSchema.
   I already tested it out that the coder itself is pickle-able.
   
   `_JsonToDictCoder` is the default `self.coder` used to create the coder from a TableSchma, but not necessarily the only possible implementation.
   
   So your `_BigQueryExportResult` class could be:
   
   ```python
   @dataclass
   class _BigQueryExportResult:
      coder: beam.coders.Coder
      paths: List[str]
   ```
   
   And
   
   ```python
   export_result = __BigQueryExportResult(coder=self.coder(table_schema), paths=[metadata.path for metadata in metadata_list])
   ```
      
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chunyang commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chunyang commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1075539669


   Run Python 3.8 PostCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] KevinGG commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
KevinGG commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1076636539


   > > Could you plz share information about the failure?
   > 
   > I don't know much other than what the [console output](https://ci-beam.apache.org/job/beam_PostCommit_Python38_PR/428/console) is showing. Something about failing to load/store cache entries.
   > 
   > ```
   > 14:29:57 [...]
   > 14:29:57 FAILURE: Build completed with 2 failures.
   > 14:29:57 
   > 14:29:57 1: Task failed with an exception.
   > 14:29:57 -----------
   > 14:29:57 * What went wrong:
   > 14:29:57 Execution failed for task ':runners:flink:1.14:job-server:shadowJar'.
   > 14:29:57 > Failed to load cache entry for task ':runners:flink:1.14:job-server:shadowJar'
   > 14:29:57 [...]
   > 14:29:57 2: Task failed with an exception.
   > 14:29:57 -----------
   > 14:29:57 * What went wrong:
   > 14:29:57 Execution failed for task ':sdks:java:io:google-cloud-platform:expansion-service:shadowJar'.
   > 14:29:57 > Failed to store cache entry for task ':sdks:java:io:google-cloud-platform:expansion-service:shadowJar'
   > 14:29:57 [...]
   > ```
   
   I doubt this Python code change would affect the target :runners:flink:1.14:job-server:shadowJar, should be unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on a change in pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on a change in pull request #17153:
URL: https://github.com/apache/beam/pull/17153#discussion_r832766516



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -657,6 +658,22 @@ def reader(self, test_bigquery_client=None):
         kms_key=self.kms_key)
 
 
+class _BigQueryExportResult:
+  def __init__(self, schema, metadata_list):
+    self.schema = schema
+    self.metadata_list = metadata_list
+
+  # Store schema as JSON string since TableSchema objects cannot be pickled.
+  @property
+  @functools.lru_cache()
+  def schema(self):
+    return bigquery_tools.parse_table_schema_from_json(self._schema)
+
+  @schema.setter
+  def schema(self, value):
+    self._schema = json.dumps(bigquery_tools.table_schema_to_dict(value))

Review comment:
       bigquery_tools.parse_table_schema_from_json is just an implementation detail of BQ connector and can change in the future so I would not assume that. "json.dumps" and "json.loads" are guaranteed to be reciprocal so I would use that instead.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chunyang commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chunyang commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1075676149


   Python integration tests passed in postcommit AFAICT, but there was a failure in `:sdks:java:io:google-cloud-platform:expansion-service:shadowJar`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chunyang commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chunyang commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1075809762


   > Could you plz share information about the failure?
   
   I don't know much other than what the [console output](https://ci-beam.apache.org/job/beam_PostCommit_Python38_PR/428/console) is showing. Something about failing to load/store cache entries.
   ```
   14:29:57 [...]
   14:29:57 FAILURE: Build completed with 2 failures.
   14:29:57 
   14:29:57 1: Task failed with an exception.
   14:29:57 -----------
   14:29:57 * What went wrong:
   14:29:57 Execution failed for task ':runners:flink:1.14:job-server:shadowJar'.
   14:29:57 > Failed to load cache entry for task ':runners:flink:1.14:job-server:shadowJar'
   14:29:57 [...]
   14:29:57 2: Task failed with an exception.
   14:29:57 -----------
   14:29:57 * What went wrong:
   14:29:57 Execution failed for task ':sdks:java:io:google-cloud-platform:expansion-service:shadowJar'.
   14:29:57 > Failed to store cache entry for task ':sdks:java:io:google-cloud-platform:expansion-service:shadowJar'
   14:29:57 [...]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on a change in pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on a change in pull request #17153:
URL: https://github.com/apache/beam/pull/17153#discussion_r833586185



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -657,6 +658,22 @@ def reader(self, test_bigquery_client=None):
         kms_key=self.kms_key)
 
 
+class _BigQueryExportResult:
+  def __init__(self, schema, metadata_list):
+    self.schema = schema
+    self.metadata_list = metadata_list
+
+  # Store schema as JSON string since TableSchema objects cannot be pickled.
+  @property
+  @functools.lru_cache()
+  def schema(self):
+    return bigquery_tools.parse_table_schema_from_json(self._schema)
+
+  @schema.setter
+  def schema(self, value):
+    self._schema = json.dumps(bigquery_tools.table_schema_to_dict(value))

Review comment:
       Seems like the types are also different ? Input is a bigquery.TableSchema object and output is JSON dict. I would implement new functions here without depending on existing functions and add extensive testing to make sure that this is valid/correct for various bigquery.TableSchema objects.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chunyang commented on a change in pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chunyang commented on a change in pull request #17153:
URL: https://github.com/apache/beam/pull/17153#discussion_r832793216



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -657,6 +658,22 @@ def reader(self, test_bigquery_client=None):
         kms_key=self.kms_key)
 
 
+class _BigQueryExportResult:
+  def __init__(self, schema, metadata_list):
+    self.schema = schema
+    self.metadata_list = metadata_list
+
+  # Store schema as JSON string since TableSchema objects cannot be pickled.
+  @property
+  @functools.lru_cache()
+  def schema(self):
+    return bigquery_tools.parse_table_schema_from_json(self._schema)
+
+  @schema.setter
+  def schema(self, value):
+    self._schema = json.dumps(bigquery_tools.table_schema_to_dict(value))

Review comment:
       I think `json.dumps`/`json.loads` can't be used directly since TableSchema is not a `dict`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chunyang commented on a change in pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chunyang commented on a change in pull request #17153:
URL: https://github.com/apache/beam/pull/17153#discussion_r832546377



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -657,6 +658,22 @@ def reader(self, test_bigquery_client=None):
         kms_key=self.kms_key)
 
 
+class _BigQueryExportResult:
+  def __init__(self, schema, metadata_list):
+    self.schema = schema
+    self.metadata_list = metadata_list
+
+  # Store schema as JSON string since TableSchema objects cannot be pickled.
+  @property
+  @functools.lru_cache()
+  def schema(self):
+    return bigquery_tools.parse_table_schema_from_json(self._schema)
+
+  @schema.setter
+  def schema(self, value):
+    self._schema = json.dumps(bigquery_tools.table_schema_to_dict(value))

Review comment:
       Are there any guarantees that `json.dumps(bigquery_tools.table_schema_to_dict(x))` and `bigquery_tools.parse_table_schema_from_json(x)` are reciprocal functions?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] KevinGG commented on a change in pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
KevinGG commented on a change in pull request #17153:
URL: https://github.com/apache/beam/pull/17153#discussion_r833606434



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -657,6 +658,22 @@ def reader(self, test_bigquery_client=None):
         kms_key=self.kms_key)
 
 
+class _BigQueryExportResult:
+  def __init__(self, schema, metadata_list):
+    self.schema = schema
+    self.metadata_list = metadata_list
+
+  # Store schema as JSON string since TableSchema objects cannot be pickled.
+  @property
+  @functools.lru_cache()
+  def schema(self):
+    return bigquery_tools.parse_table_schema_from_json(self._schema)
+
+  @schema.setter
+  def schema(self, value):
+    self._schema = json.dumps(bigquery_tools.table_schema_to_dict(value))

Review comment:
       +1 to Cham that the behavior of pickling a PTransform is not guaranteed. The workaround is limited by the current InteractiveRunner implementation.
   
   @chunyang I left a comment in the PR that explores saving the coder and paths as attributes instead of the TableSchema and matadata_list. Could you please check if that is an appropriate approach?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] codecov[bot] commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1075563941


   # [Codecov](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#17153](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (cd6fea8) into [master](https://codecov.io/gh/apache/beam/commit/9a36490de8129359106baf37e1d0b071e19e303a?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9a36490) will **increase** coverage by `0.00%`.
   > The diff coverage is `52.94%`.
   
   ```diff
   @@           Coverage Diff           @@
   ##           master   #17153   +/-   ##
   =======================================
     Coverage   73.95%   73.95%           
   =======================================
     Files         671      671           
     Lines       88245    88257   +12     
   =======================================
   + Hits        65264    65273    +9     
   - Misses      21869    21872    +3     
     Partials     1112     1112           
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | python | `83.63% <52.94%> (-0.01%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [sdks/python/apache\_beam/io/gcp/bigquery.py](https://codecov.io/gh/apache/beam/pull/17153/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vaW8vZ2NwL2JpZ3F1ZXJ5LnB5) | `63.67% <52.94%> (+0.04%)` | :arrow_up: |
   | [...hon/apache\_beam/runners/worker/bundle\_processor.py](https://codecov.io/gh/apache/beam/pull/17153/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy93b3JrZXIvYnVuZGxlX3Byb2Nlc3Nvci5weQ==) | `93.39% <0.00%> (ø)` | |
   | [...che\_beam/runners/interactive/interactive\_runner.py](https://codecov.io/gh/apache/beam/pull/17153/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9pbnRlcmFjdGl2ZS9pbnRlcmFjdGl2ZV9ydW5uZXIucHk=) | `93.57% <0.00%> (+0.71%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [9a36490...cd6fea8](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on a change in pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on a change in pull request #17153:
URL: https://github.com/apache/beam/pull/17153#discussion_r833598733



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -657,6 +658,22 @@ def reader(self, test_bigquery_client=None):
         kms_key=self.kms_key)
 
 
+class _BigQueryExportResult:
+  def __init__(self, schema, metadata_list):
+    self.schema = schema
+    self.metadata_list = metadata_list
+
+  # Store schema as JSON string since TableSchema objects cannot be pickled.
+  @property
+  @functools.lru_cache()
+  def schema(self):
+    return bigquery_tools.parse_table_schema_from_json(self._schema)
+
+  @schema.setter
+  def schema(self, value):
+    self._schema = json.dumps(bigquery_tools.table_schema_to_dict(value))

Review comment:
       As a higher level point though. I think all state of a source object being packable (including temporary state) is not a guarantee we provide and it will be very difficult to guarantee this behavior in the future for BQ and other sources. Also that can make source implementations more restrictive.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chunyang commented on a change in pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chunyang commented on a change in pull request #17153:
URL: https://github.com/apache/beam/pull/17153#discussion_r833591760



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -657,6 +658,22 @@ def reader(self, test_bigquery_client=None):
         kms_key=self.kms_key)
 
 
+class _BigQueryExportResult:
+  def __init__(self, schema, metadata_list):
+    self.schema = schema
+    self.metadata_list = metadata_list
+
+  # Store schema as JSON string since TableSchema objects cannot be pickled.
+  @property
+  @functools.lru_cache()
+  def schema(self):
+    return bigquery_tools.parse_table_schema_from_json(self._schema)
+
+  @schema.setter
+  def schema(self, value):
+    self._schema = json.dumps(bigquery_tools.table_schema_to_dict(value))

Review comment:
       Yeah, that sounds like the right approach. I will likely try to extract [`_convert_to_tuple`](https://github.com/apache/beam/blob/v2.34.0/sdks/python/apache_beam/io/gcp/bigquery_read_internal.py#L401-L413) from `_JsonToDictCoder` and build on top of that. I won't be able to get to it before the release cut today though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chunyang commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chunyang commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1075539872


   Run PythonLint PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] KevinGG commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
KevinGG commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1075772075


   > Python integration tests passed in postcommit AFAICT, but there was a failure in `:sdks:java:io:google-cloud-platform:expansion-service:shadowJar`.
   
   Could you plz share information about the failure? The PR looks nice, IIUIC: similar to the reverted one but export the files before the temp dataset gets cleaned up. The reason to add the intermediate JSON conversion is because InteractiveRunner needs to pickle the PTransform while the table schema nor a generator are pickle-able as an attribute.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chunyang commented on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
chunyang commented on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1075818775


   Just to explore some other options other than converting schema to JSON, prior to #15610, there was no generator or TableSchema to cause pickling errors.
   
   Instead of storing the generator as an instance attribute, a list of [TextSource](https://github.com/apache/beam/blob/v2.34.0/sdks/python/apache_beam/io/gcp/bigquery.py#L793-L798) and notably its `coder` attribute were stored instead (we can assume `use_json_exports=True` for this discussion). The default `coder`, `_JsonToDictCoder`, had a method [`_convert_to_tuple`](https://github.com/apache/beam/blob/v2.34.0/sdks/python/apache_beam/io/gcp/bigquery_read_internal.py#L401-L413) to marshal the TableSchema into an object more amenable to pickling.
   
   Perhaps I can use the same `_convert_to_tuple` method to create a picklable version of TableSchema and store that as an attribute rather than going the JSON route?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] KevinGG commented on a change in pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
KevinGG commented on a change in pull request #17153:
URL: https://github.com/apache/beam/pull/17153#discussion_r832685061



##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -657,6 +658,22 @@ def reader(self, test_bigquery_client=None):
         kms_key=self.kms_key)
 
 
+class _BigQueryExportResult:
+  def __init__(self, schema, metadata_list):
+    self.schema = schema
+    self.metadata_list = metadata_list
+
+  # Store schema as JSON string since TableSchema objects cannot be pickled.
+  @property
+  @functools.lru_cache()
+  def schema(self):
+    return bigquery_tools.parse_table_schema_from_json(self._schema)
+
+  @schema.setter
+  def schema(self, value):
+    self._schema = json.dumps(bigquery_tools.table_schema_to_dict(value))

Review comment:
       The implementation is around https://cloud.google.com/bigquery/docs/schemas#creating_a_json_schema_file, so they do seem to be reciprocal functions. Need @chamikaramj to double check this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] codecov[bot] edited a comment on pull request #17153: [BEAM-14112] Avoid storing a generator in _CustomBigQuerySource

Posted by GitBox <gi...@apache.org>.
codecov[bot] edited a comment on pull request #17153:
URL: https://github.com/apache/beam/pull/17153#issuecomment-1075563941


   # [Codecov](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#17153](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (cd6fea8) into [master](https://codecov.io/gh/apache/beam/commit/9a36490de8129359106baf37e1d0b071e19e303a?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9a36490) will **increase** coverage by `0.00%`.
   > The diff coverage is `52.94%`.
   
   ```diff
   @@           Coverage Diff           @@
   ##           master   #17153   +/-   ##
   =======================================
     Coverage   73.95%   73.95%           
   =======================================
     Files         671      671           
     Lines       88245    88257   +12     
   =======================================
   + Hits        65264    65273    +9     
   - Misses      21869    21872    +3     
     Partials     1112     1112           
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | python | `83.63% <52.94%> (-0.01%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [sdks/python/apache\_beam/io/gcp/bigquery.py](https://codecov.io/gh/apache/beam/pull/17153/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vaW8vZ2NwL2JpZ3F1ZXJ5LnB5) | `63.67% <52.94%> (+0.04%)` | :arrow_up: |
   | [...hon/apache\_beam/runners/worker/bundle\_processor.py](https://codecov.io/gh/apache/beam/pull/17153/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy93b3JrZXIvYnVuZGxlX3Byb2Nlc3Nvci5weQ==) | `93.39% <0.00%> (ø)` | |
   | [...che\_beam/runners/interactive/interactive\_runner.py](https://codecov.io/gh/apache/beam/pull/17153/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9pbnRlcmFjdGl2ZS9pbnRlcmFjdGl2ZV9ydW5uZXIucHk=) | `93.57% <0.00%> (+0.71%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [9a36490...cd6fea8](https://codecov.io/gh/apache/beam/pull/17153?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org