You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/07/26 20:04:42 UTC

[GitHub] [beam] ahmedabu98 commented on a diff in pull request #22452: Update BigQuery URI validation to allow more valid URIs through

ahmedabu98 commented on code in PR #22452:
URL: https://github.com/apache/beam/pull/22452#discussion_r930360989


##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -101,6 +101,10 @@
 # Timeout for a BQ streaming insert RPC. Set to a maximum of 2 minutes.
 BQ_STREAMING_INSERT_TIMEOUT_SEC = 120
 
+_PROJECT_PATTERN = r'[a-z][a-z0-9-]{4,28}[a-z0-9]'

Review Comment:
   I suspect this is why precommits are failing: https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/24179/testReport/



##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -254,11 +258,11 @@ def parse_table_reference(table, dataset=None, project=None):
   # table argument will contain a full table reference instead of just a
   # table name.
   if dataset is None:
-    regex = re.compile(
-        r'''^((?P<project>.+):)?(?P<dataset>\w+)\.
-            (?P<table>[-\w\$]+(\s+\-*\w+)*)$''',
-        re.X)
-    match = regex.match(table)
+    pattern = (
+        f'(?P<project>{_PROJECT_PATTERN})'
+        f'[:\\.](?P<dataset>{_DATASET_PATTERN})'
+        f'[:\\.](?P<table>{_TABLE_PATTERN})')

Review Comment:
   Glad to see you're allowing for the dot in `project.dataset` too.
   
   Does BigQuery allow the colon operator in between `dataset` and `table`? AFAIK it should be dot operator.



##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -101,6 +101,10 @@
 # Timeout for a BQ streaming insert RPC. Set to a maximum of 2 minutes.
 BQ_STREAMING_INSERT_TIMEOUT_SEC = 120
 
+_PROJECT_PATTERN = r'[a-z][a-z0-9-]{4,28}[a-z0-9]'

Review Comment:
   Minimum project character requirement is 4, see [here](https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project). As it is now the minimum is 6. Perhaps the lower end of the range could be 2.



##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -101,6 +101,10 @@
 # Timeout for a BQ streaming insert RPC. Set to a maximum of 2 minutes.
 BQ_STREAMING_INSERT_TIMEOUT_SEC = 120
 
+_PROJECT_PATTERN = r'[a-z][a-z0-9-]{4,28}[a-z0-9]'
+_DATASET_PATTERN = r'\w{1,1024}'
+_TABLE_PATTERN = r'[\p{L}\p{M}\p{N}\p{Pc}\p{Pd}\p{Zs}]{1,1024}'

Review Comment:
   Does this include the partition character `$`? This is needed when user is specifying a table partition, see https://cloud.google.com/bigquery/docs/managing-partitioned-table-data#partition_decorators.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org