You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/04/18 18:22:48 UTC

[GitHub] [beam] johnjcasey commented on a diff in pull request #17380: [BEAM-14314] Add last_updated field in filesystem.FileMetaData

johnjcasey commented on code in PR #17380:
URL: https://github.com/apache/beam/pull/17380#discussion_r852311644


##########
sdks/python/apache_beam/io/azure/blobstorageio.py:
##########
@@ -372,29 +355,47 @@ def last_updated(self, path):
     Returns: last updated time of the Azure Blob Storage blob
     in seconds.
     """
-    container, blob = parse_azfs_path(path)
-    blob_to_check = self.client.get_blob_client(container, blob)
-    try:
-      properties = blob_to_check.get_blob_properties()
-    except ResourceNotFoundError as e:
-      message = e.reason
-      code = e.status_code
-      raise BlobStorageError(message, code)
+    return self._updated_to_seconds(self._blob_properties(path).last_modified)
 
-    datatime = properties.last_modified
-    return (
-        time.mktime(datatime.timetuple()) - time.timezone +
-        datatime.microsecond / 1000000.0)
-
-  @retry.with_exponential_backoff(
-      retry_filter=retry.retry_on_beam_io_error_filter)
   def checksum(self, path):
     """Looks up the checksum of an Azure Blob Storage blob.
 
     Args:
       path: Azure Blob Storage file path pattern in the form
             azfs://<storage-account>/<container>/[name].
     """
+    return self._blob_properties(path).properties.etag
+
+  def _vars(self, path):

Review Comment:
   can we improve the name here too?



##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -421,27 +421,44 @@ def __exit__(self, exception_type, exception_value, traceback):
 
 
 class FileMetadata(object):
-  """Metadata about a file path that is the output of FileSystem.match."""
-  def __init__(self, path, size_in_bytes):
+  """Metadata about a file path that is the output of FileSystem.match.
+
+  Fields:
+    path: [Required] file path.
+    size_in_bytes: [Required] file size in bytes.
+    last_updated_in_seconds: [Optional] last modified timestamp of the file, or
+    valued 0.0 if not specified.

Review Comment:
   Does a timestamp in seconds match what we have in java? I would expect it to be millis since epoch generally



##########
sdks/python/apache_beam/io/gcp/gcsio.py:
##########
@@ -532,39 +510,83 @@ def last_updated(self, path):
 
     Returns: last updated time of the GCS object in second.
     """
+    return self._updated_to_seconds(self._gcs_object(path).updated)
+
+  def _vars(self, path):

Review Comment:
   I may be missing context, but _vars doesn't tell me much about what this method does. Can we get a better name here?



##########
sdks/python/apache_beam/io/aws/s3io.py:
##########
@@ -464,6 +459,47 @@ def exists(self, path):
         # We re-raise all other exceptions
         raise
 
+  def _vars(self, path):

Review Comment:
   same here



##########
sdks/python/apache_beam/io/hadoopfilesystem.py:
##########
@@ -399,6 +407,26 @@ def checksum(self, url):
         file_checksum[_FILE_CHECKSUM_BYTES],
     )
 
+  def metadata(self, url):
+    """Fetch metadata fields of a file on the FileSystem.
+
+    Args:
+      url: string url of a file.
+
+    Returns:
+      :class:`~apache_beam.io.filesystem.FileMetadata`.
+      Note: last_updated field is not supported yet.
+
+    Raises:
+      ``BeamIOError``: if url doesn't exist.
+    """
+    _, path = self._parse_url(url)
+    status = self._hdfs_client.status(path, strict=False)
+    print(status)
+    if status is None:
+      raise BeamIOError('File not found: %s' % url)
+    return FileMetadata(url, status[_FILE_STATUS_LENGTH])

Review Comment:
   does hdfs have a way to get the timestamp?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org