You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/04/22 19:36:35 UTC

[GitHub] [airflow] potiuk commented on a change in pull request #4766: [AIRFLOW-3720] Add prefix to file match in GCS_TO_S3 operator to avoid missmatch

potiuk commented on a change in pull request #4766: [AIRFLOW-3720] Add prefix to file match in GCS_TO_S3 operator to avoid missmatch
URL: https://github.com/apache/airflow/pull/4766#discussion_r277405643
 
 

 ##########
 File path: airflow/contrib/operators/gcs_to_s3.py
 ##########
 @@ -99,8 +99,12 @@ def execute(self, context):
             # if we are not replacing -> list all files in the S3 bucket
             # and only keep those files which are present in
             # Google Cloud Storage and not in S3
-            bucket_name, _ = S3Hook.parse_s3_url(self.dest_s3_key)
-            existing_files = s3_hook.list_keys(bucket_name)
+            bucket_name, prefix = S3Hook.parse_s3_url(self.dest_s3_key)
+            # look for the bucket and the prefix to avoid look into
+            # parent directories/keys
+            existing_files = s3_hook.list_keys(bucket_name, prefix=prefix)
+            # remove the prefix for the existing files to allow the match
+            existing_files = [file.replace(prefix, '') for file in existing_files]
 
 Review comment:
   Please add count=1 in replace method. There is a slim chance that prefix will occur two times in the key path but it's possible: s3://bucket/repeated_key/repeated_key

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services