You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by GitBox <gi...@apache.org> on 2023/01/14 01:52:38 UTC

[GitHub] [gobblin] ZihanLi58 opened a new pull request, #3624: [GOBBLIN-1765] Add support to sync metadata for dir in manifest based copy

ZihanLi58 opened a new pull request, #3624:
URL: https://github.com/apache/gobblin/pull/3624

   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1765
   
   
   ### Description
   - [ ] Here are some details about my PR, including screenshots (if applicable):
   We want to add support so that when user specify a dir in manifest file, we will blindly sync the metadata for that dir regardless the modification time.
   This change include:
   1. add config to support whether we want preserve mod time for dir and by default it's true
   2. Change the FileAwareInpuSTreamWriter so that during commit, when we see the output path already exist and CopyableFile is one directory, we won't rename/overwrite the target dir incase we overwrite the file that already been copied in the same job.   We still want to add this dir as copyableFile because only in this case, we can make sure publisher will be executed even there is no other file to copy and handle the metadata sync for dirs. 
   3. Change the shouldCopy method in manifest to 
     a. Always copy dir
     b. Always copy file if source has newer version
     c. Copy file is source and dst have same version but target permission is different on dst. (this is because chown won't change the modification time)
     d. never copy file if dst has newer version
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   Existing unit test all pass. Also test the real manifest copy job and verify that we can sync metadata for dir without affecting the files already existed in the dir
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       4. Subject is limited to 50 characters
       5. Subject does not end with a period
       6. Subject uses the imperative mood ("add", not "adding")
       7. Body wraps at 72 characters
       8. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] phet commented on a diff in pull request #3624: [GOBBLIN-1765] Add support to sync metadata for dir in manifest based copy

Posted by GitBox <gi...@apache.org>.
phet commented on code in PR #3624:
URL: https://github.com/apache/gobblin/pull/3624#discussion_r1080640043


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -125,9 +123,31 @@ public Iterator<FileSet<CopyEntity>> getFileSetIterator(FileSystem targetFs, Cop
     return Collections.singleton(new FileSet.Builder<>(datasetURN(), this).add(copyEntities).build()).iterator();
   }
 
-  private static boolean shouldCopy(FileStatus fileInSource, FileStatus fileInTarget) {
-    //todo: make the rule configurable
-    return fileInSource.getModificationTime() > fileInTarget
-        .getModificationTime();
+  private static boolean shouldCopy(FileStatus fileInSource, FileStatus fileInTarget, CopyConfiguration copyConfiguration) {
+    if (fileInSource.isDirectory()) {
+      // If dir is specified in the manifest, we blindly syn them on the metadata without concerning the mod time
+      return true;
+    }
+    if (fileInSource.getModificationTime() > fileInTarget.getModificationTime()) {
+      // We always copy file if source has a newer version
+      return true;
+    }
+    if (fileInSource.getModificationTime() == fileInTarget.getModificationTime()) {
+      // if source and dst has same version, we compare the permission to determine whether it needs another sync
+      OwnerAndPermission replicatedPermission = CopyableFile.resolveReplicatedOwnerAndPermission(fileInSource, copyConfiguration);
+      boolean needCopy = false;
+      if (replicatedPermission.getGroup() != null ) {
+        needCopy = !fileInTarget.getGroup().equals(replicatedPermission.getGroup());

Review Comment:
   consider encapsulating within a method on `OwnerAndPermission`:
   ```
   public boolean requiresGroupChange(FileStatus targetFStatus) { }
   ```
   all three could roll up into:
   ```
   public boolean requiresChange(FileStatus targetFStatus) {
     return requiresGroupChange(targetFStatus) || requiresOwnerChange(...) || requiresPermissionChange(...);
   }
   ```



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -125,9 +123,31 @@ public Iterator<FileSet<CopyEntity>> getFileSetIterator(FileSystem targetFs, Cop
     return Collections.singleton(new FileSet.Builder<>(datasetURN(), this).add(copyEntities).build()).iterator();
   }
 
-  private static boolean shouldCopy(FileStatus fileInSource, FileStatus fileInTarget) {
-    //todo: make the rule configurable
-    return fileInSource.getModificationTime() > fileInTarget
-        .getModificationTime();
+  private static boolean shouldCopy(FileStatus fileInSource, FileStatus fileInTarget, CopyConfiguration copyConfiguration) {
+    if (fileInSource.isDirectory()) {
+      // If dir is specified in the manifest, we blindly syn them on the metadata without concerning the mod time

Review Comment:
   "syn" => "sync"



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/CopyableFile.java:
##########
@@ -316,23 +316,33 @@ private List<OwnerAndPermission> replicateAncestorsOwnerAndPermission(FileSystem
    */
   public static OwnerAndPermission resolveReplicatedOwnerAndPermission(FileSystem fs, Path path,
       CopyConfiguration copyConfiguration) throws IOException {
-
-    PreserveAttributes preserve = copyConfiguration.getPreserve();
     Optional<FileStatus> originFileStatus = copyConfiguration.getCopyContext().getFileStatus(fs, path);
 
     if (!originFileStatus.isPresent()) {
       throw new IOException(String.format("Origin path %s does not exist.", path));
     }
 
+    return resolveReplicatedOwnerAndPermission(originFileStatus.get(), copyConfiguration);
+  }
+
+  /**
+   * Computes the correct {@link OwnerAndPermission} obtained from replicating source owner and permissions and applying
+   * the {@link PreserveAttributes} rules in copyConfiguration.
+   * @throws IOException
+   */
+  public static OwnerAndPermission resolveReplicatedOwnerAndPermission(FileStatus originFileStatus,
+      CopyConfiguration copyConfiguration) {
+
+    PreserveAttributes preserve = copyConfiguration.getPreserve();

Review Comment:
   since you always need this within, I'd have the caller pass it.  to have `shouldCopy` (below) specificifically require `PreserveAttributes`, rather than the more broad `CopyConfiguration`, would make for clearer code.



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/publisher/CopyDataPublisher.java:
##########
@@ -185,27 +191,35 @@ private static Multimap<CopyEntity.DatasetAndPartition, WorkUnitState> groupByFi
    * and versionStrategy (usually relevant to modTime as well), since they are subject to change with Publish(rename)
    */
   private void preserveFileAttrInPublisher(CopyableFile copyableFile) throws IOException {
-    // Preserving File ModTime, and set the access time to an initializing value when ModTime is declared to be preserved.
-    if (copyableFile.getPreserve().preserve(PreserveAttributes.Option.MOD_TIME)) {
-      fs.setTimes(copyableFile.getDestination(), copyableFile.getOriginTimestamp(), -1);
+    if (!copyableFile.getFileStatus().isFile() && state.getProp(DatasetUtils.DATASET_PROFILE_CLASS_KEY, "").equals(

Review Comment:
   in `shouldCopy`, you checked `FileStatus::isDirectory()`, why `!FileStatus::isFile()` here?  I believe they're interchangeable on HDFS, but in the general case I'm uncertain they are strictly equivalent



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/publisher/CopyDataPublisher.java:
##########
@@ -185,27 +191,35 @@ private static Multimap<CopyEntity.DatasetAndPartition, WorkUnitState> groupByFi
    * and versionStrategy (usually relevant to modTime as well), since they are subject to change with Publish(rename)
    */
   private void preserveFileAttrInPublisher(CopyableFile copyableFile) throws IOException {
-    // Preserving File ModTime, and set the access time to an initializing value when ModTime is declared to be preserved.
-    if (copyableFile.getPreserve().preserve(PreserveAttributes.Option.MOD_TIME)) {
-      fs.setTimes(copyableFile.getDestination(), copyableFile.getOriginTimestamp(), -1);
+    if (!copyableFile.getFileStatus().isFile() && state.getProp(DatasetUtils.DATASET_PROFILE_CLASS_KEY, "").equals(
+        ManifestBasedDatasetFinder.class.getCanonicalName())){
+      FileStatus dstFile = this.fs.getFileStatus(copyableFile.getDestination());
+      // User specifically try to copy dir metadata (which should be only doable in manifest copy), so we change the group and permissions on destination even the dir already existed
+      FileAwareInputStreamDataWriter.safeSetPathPermission(this.fs, copyableFile.getDestination(),
+          FileAwareInputStreamDataWriter.addExecutePermissionsIfRequired(dstFile, copyableFile.getDestinationOwnerAndPermission()));

Review Comment:
   since it's not called anywhere else outside the class (given it was formerly `private`), let's perform the `addExecutePermissionsIfRequired` call within.  since it seems we want it to happen all the time, let's not bring the potential for the caller to forget



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/publisher/CopyDataPublisher.java:
##########
@@ -185,27 +191,35 @@ private static Multimap<CopyEntity.DatasetAndPartition, WorkUnitState> groupByFi
    * and versionStrategy (usually relevant to modTime as well), since they are subject to change with Publish(rename)
    */
   private void preserveFileAttrInPublisher(CopyableFile copyableFile) throws IOException {
-    // Preserving File ModTime, and set the access time to an initializing value when ModTime is declared to be preserved.
-    if (copyableFile.getPreserve().preserve(PreserveAttributes.Option.MOD_TIME)) {
-      fs.setTimes(copyableFile.getDestination(), copyableFile.getOriginTimestamp(), -1);
+    if (!copyableFile.getFileStatus().isFile() && state.getProp(DatasetUtils.DATASET_PROFILE_CLASS_KEY, "").equals(
+        ManifestBasedDatasetFinder.class.getCanonicalName())){

Review Comment:
   I see the reason why, but it's really an anti-pattern to determine behavior based on hard-coded class names.  seems even less desirable to do so in an unrelated class (as this class doesn't otherwise know about datasets or dataset finders).
   
   might there be another way, such as inserting a config flag, with a clear k-v pair (other than the name of this distant impl. class)?  e.g. `alwaysOverwriteDirMetadata`?



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/publisher/CopyDataPublisher.java:
##########
@@ -185,27 +191,35 @@ private static Multimap<CopyEntity.DatasetAndPartition, WorkUnitState> groupByFi
    * and versionStrategy (usually relevant to modTime as well), since they are subject to change with Publish(rename)
    */
   private void preserveFileAttrInPublisher(CopyableFile copyableFile) throws IOException {
-    // Preserving File ModTime, and set the access time to an initializing value when ModTime is declared to be preserved.
-    if (copyableFile.getPreserve().preserve(PreserveAttributes.Option.MOD_TIME)) {
-      fs.setTimes(copyableFile.getDestination(), copyableFile.getOriginTimestamp(), -1);
+    if (!copyableFile.getFileStatus().isFile() && state.getProp(DatasetUtils.DATASET_PROFILE_CLASS_KEY, "").equals(
+        ManifestBasedDatasetFinder.class.getCanonicalName())){
+      FileStatus dstFile = this.fs.getFileStatus(copyableFile.getDestination());
+      // User specifically try to copy dir metadata (which should be only doable in manifest copy), so we change the group and permissions on destination even the dir already existed

Review Comment:
   "even the dir already existed" => "even when the dir already exists"
   ?



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/writer/FileAwareInputStreamDataWriter.java:
##########
@@ -387,15 +387,15 @@ private void setRecursivePermission(Path path, OwnerAndPermission ownerAndPermis
     Collections.reverse(files);
 
     for (FileStatus file : files) {
-      safeSetPathPermission(file.getPath(), addExecutePermissionsIfRequired(file, ownerAndPermission));
+      safeSetPathPermission(this.fs, file.getPath(), addExecutePermissionsIfRequired(file, ownerAndPermission));
     }
   }
 
   /**
    * The method makes sure it always grants execute permissions for an owner if the <code>file</code> passed is a
    * directory. The publisher needs it to publish it to the final directory and list files under this directory.
    */
-  private static OwnerAndPermission addExecutePermissionsIfRequired(FileStatus file,
+  public static OwnerAndPermission addExecutePermissionsIfRequired(FileStatus file,

Review Comment:
   while you're at it--and before it becomes `public`--I would rename to `setOwnerExecuteBitIfDirectory`... 
   (should be safe, as this was previously `private`)



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/writer/FileAwareInputStreamDataWriter.java:
##########
@@ -456,10 +456,14 @@ public void commit()
               : copyableFile.getAncestorsOwnerAndPermission().iterator();
 
       ensureDirectoryExists(this.fs, outputFilePath.getParent(), ancestorOwnerAndPermissionIt);
-
-      // Do not store the FileContext after doing the rename because FileContexts are not cached and a new object
-      // is created for every task's commit
-      FileContext.getFileContext(this.uri, this.conf).rename(stagingFilePath, outputFilePath, renameOptions);
+      if (copyableFile.getFileStatus().isDirectory() && this.fs.exists(outputFilePath)) {
+        log.info(String.format("CopyableFile %s is a directory which is already exist on outputPath %s,"
+            + " will not overwrite it in writer, publisher should take care of the metadata sync if needed", stagingFilePath, outputFilePath));

Review Comment:
   suggestion:
   ```
   "which is already exist on outputPath %s, will not overwrite it in writer, publisher should take care of the metadata sync if needed",
   ```
     =>
   ```
   "which already exists at %s - skipping overwrite; if necessary, publisher will sync metadata",
   ```



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/publisher/CopyDataPublisher.java:
##########
@@ -185,27 +191,35 @@ private static Multimap<CopyEntity.DatasetAndPartition, WorkUnitState> groupByFi
    * and versionStrategy (usually relevant to modTime as well), since they are subject to change with Publish(rename)
    */
   private void preserveFileAttrInPublisher(CopyableFile copyableFile) throws IOException {
-    // Preserving File ModTime, and set the access time to an initializing value when ModTime is declared to be preserved.
-    if (copyableFile.getPreserve().preserve(PreserveAttributes.Option.MOD_TIME)) {
-      fs.setTimes(copyableFile.getDestination(), copyableFile.getOriginTimestamp(), -1);
+    if (!copyableFile.getFileStatus().isFile() && state.getProp(DatasetUtils.DATASET_PROFILE_CLASS_KEY, "").equals(
+        ManifestBasedDatasetFinder.class.getCanonicalName())){
+      FileStatus dstFile = this.fs.getFileStatus(copyableFile.getDestination());
+      // User specifically try to copy dir metadata (which should be only doable in manifest copy), so we change the group and permissions on destination even the dir already existed
+      FileAwareInputStreamDataWriter.safeSetPathPermission(this.fs, copyableFile.getDestination(),
+          FileAwareInputStreamDataWriter.addExecutePermissionsIfRequired(dstFile, copyableFile.getDestinationOwnerAndPermission()));
     }
+    if ((!copyableFile.getFileStatus().isFile() && preserveDirModTime) || copyableFile.getFileStatus().isFile()) {

Review Comment:
   seems clearer:
   ```
   if (copyableFile.getFileStatus.isFile() || preserveDirModTime) {
   ```
   also, above, when setting `preserveDirModTime`, good to note the default of true preserves prior semantics



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] codecov-commenter commented on pull request #3624: [GOBBLIN-1765] Add support to sync metadata for dir in manifest based copy

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #3624:
URL: https://github.com/apache/gobblin/pull/3624#issuecomment-1382634074

   # [Codecov](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#3624](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (4c764ee) into [master](https://codecov.io/gh/apache/gobblin/commit/85cd1f9875ba64bd059b70a4d5fc40a38383d326?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (85cd1f9) will **decrease** coverage by `2.65%`.
   > The diff coverage is `n/a`.
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #3624      +/-   ##
   ============================================
   - Coverage     46.44%   43.78%   -2.66%     
   + Complexity    10621     2063    -8558     
   ============================================
     Files          2130      409    -1721     
     Lines         83303    17626   -65677     
     Branches       9282     2152    -7130     
   ============================================
   - Hits          38692     7718   -30974     
   + Misses        41039     9049   -31990     
   + Partials       3572      859    -2713     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...obblin/data/management/copy/CopyConfiguration.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1kYXRhLW1hbmFnZW1lbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YS9tYW5hZ2VtZW50L2NvcHkvQ29weUNvbmZpZ3VyYXRpb24uamF2YQ==) | | |
   | [...che/gobblin/data/management/copy/CopyableFile.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1kYXRhLW1hbmFnZW1lbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YS9tYW5hZ2VtZW50L2NvcHkvQ29weWFibGVGaWxlLmphdmE=) | | |
   | [...lin/data/management/copy/ManifestBasedDataset.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1kYXRhLW1hbmFnZW1lbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YS9tYW5hZ2VtZW50L2NvcHkvTWFuaWZlc3RCYXNlZERhdGFzZXQuamF2YQ==) | | |
   | [...a/management/copy/publisher/CopyDataPublisher.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1kYXRhLW1hbmFnZW1lbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YS9tYW5hZ2VtZW50L2NvcHkvcHVibGlzaGVyL0NvcHlEYXRhUHVibGlzaGVyLmphdmE=) | | |
   | [...nt/copy/writer/FileAwareInputStreamDataWriter.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1kYXRhLW1hbmFnZW1lbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YS9tYW5hZ2VtZW50L2NvcHkvd3JpdGVyL0ZpbGVBd2FyZUlucHV0U3RyZWFtRGF0YVdyaXRlci5qYXZh) | | |
   | [...apache/gobblin/runtime/cli/PasswordManagerCLI.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1ydW50aW1lL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3J1bnRpbWUvY2xpL1Bhc3N3b3JkTWFuYWdlckNMSS5qYXZh) | | |
   | [...va/org/apache/gobblin/service/FlowClientUtils.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1yZXN0bGkvZ29iYmxpbi1mbG93LWNvbmZpZy1zZXJ2aWNlL2dvYmJsaW4tZmxvdy1jb25maWctc2VydmljZS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vc2VydmljZS9GbG93Q2xpZW50VXRpbHMuamF2YQ==) | | |
   | [...gement/copy/TimeAwareRecursiveCopyableDataset.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1kYXRhLW1hbmFnZW1lbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YS9tYW5hZ2VtZW50L2NvcHkvVGltZUF3YXJlUmVjdXJzaXZlQ29weWFibGVEYXRhc2V0LmphdmE=) | | |
   | [...service/modules/dataset/HttpDatasetDescriptor.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3NlcnZpY2UvbW9kdWxlcy9kYXRhc2V0L0h0dHBEYXRhc2V0RGVzY3JpcHRvci5qYXZh) | | |
   | [...n/runtime/std/DefaultJobLifecycleListenerImpl.java](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-Z29iYmxpbi1ydW50aW1lL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3J1bnRpbWUvc3RkL0RlZmF1bHRKb2JMaWZlY3ljbGVMaXN0ZW5lckltcGwuamF2YQ==) | | |
   | ... and [1712 more](https://codecov.io/gh/apache/gobblin/pull/3624?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] phet commented on a diff in pull request #3624: [GOBBLIN-1765] Add support to sync metadata for dir in manifest based copy

Posted by GitBox <gi...@apache.org>.
phet commented on code in PR #3624:
URL: https://github.com/apache/gobblin/pull/3624#discussion_r1080643334


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/CopyableFile.java:
##########
@@ -316,23 +316,33 @@ private List<OwnerAndPermission> replicateAncestorsOwnerAndPermission(FileSystem
    */
   public static OwnerAndPermission resolveReplicatedOwnerAndPermission(FileSystem fs, Path path,
       CopyConfiguration copyConfiguration) throws IOException {
-
-    PreserveAttributes preserve = copyConfiguration.getPreserve();
     Optional<FileStatus> originFileStatus = copyConfiguration.getCopyContext().getFileStatus(fs, path);
 
     if (!originFileStatus.isPresent()) {
       throw new IOException(String.format("Origin path %s does not exist.", path));
     }
 
+    return resolveReplicatedOwnerAndPermission(originFileStatus.get(), copyConfiguration);
+  }
+
+  /**
+   * Computes the correct {@link OwnerAndPermission} obtained from replicating source owner and permissions and applying
+   * the {@link PreserveAttributes} rules in copyConfiguration.
+   * @throws IOException
+   */
+  public static OwnerAndPermission resolveReplicatedOwnerAndPermission(FileStatus originFileStatus,
+      CopyConfiguration copyConfiguration) {
+
+    PreserveAttributes preserve = copyConfiguration.getPreserve();

Review Comment:
   since you always need this within, I'd have the caller pass it.  to have `shouldCopy` (below) specifically require `PreserveAttributes`, rather than the more broad `CopyConfiguration`, would make for clearer code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 commented on a diff in pull request #3624: [GOBBLIN-1765] Add support to sync metadata for dir in manifest based copy

Posted by GitBox <gi...@apache.org>.
ZihanLi58 commented on code in PR #3624:
URL: https://github.com/apache/gobblin/pull/3624#discussion_r1081946270


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/CopyableFile.java:
##########
@@ -316,23 +316,33 @@ private List<OwnerAndPermission> replicateAncestorsOwnerAndPermission(FileSystem
    */
   public static OwnerAndPermission resolveReplicatedOwnerAndPermission(FileSystem fs, Path path,
       CopyConfiguration copyConfiguration) throws IOException {
-
-    PreserveAttributes preserve = copyConfiguration.getPreserve();
     Optional<FileStatus> originFileStatus = copyConfiguration.getCopyContext().getFileStatus(fs, path);
 
     if (!originFileStatus.isPresent()) {
       throw new IOException(String.format("Origin path %s does not exist.", path));
     }
 
+    return resolveReplicatedOwnerAndPermission(originFileStatus.get(), copyConfiguration);
+  }
+
+  /**
+   * Computes the correct {@link OwnerAndPermission} obtained from replicating source owner and permissions and applying
+   * the {@link PreserveAttributes} rules in copyConfiguration.
+   * @throws IOException
+   */
+  public static OwnerAndPermission resolveReplicatedOwnerAndPermission(FileStatus originFileStatus,
+      CopyConfiguration copyConfiguration) {
+
+    PreserveAttributes preserve = copyConfiguration.getPreserve();

Review Comment:
   I want to use the same existing function to calculate the final permission which is  CopyableFile.resolveReplicatedOwnerAndPermission() so that we won't ends up with different behavior in the future when we modify the logic of permission preserving; That function require the whole copyConfiguration because other than the preserve config, we also support target group config to set the user defined group, and I assume that we might want to support something like this for facl as well in the future. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] ZihanLi58 merged pull request #3624: [GOBBLIN-1765] Add support to sync metadata for dir in manifest based copy

Posted by "ZihanLi58 (via GitHub)" <gi...@apache.org>.
ZihanLi58 merged PR #3624:
URL: https://github.com/apache/gobblin/pull/3624


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [gobblin] phet commented on a diff in pull request #3624: [GOBBLIN-1765] Add support to sync metadata for dir in manifest based copy

Posted by "phet (via GitHub)" <gi...@apache.org>.
phet commented on code in PR #3624:
URL: https://github.com/apache/gobblin/pull/3624#discussion_r1082979961


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/OwnerAndPermission.java:
##########
@@ -66,4 +67,26 @@ public static OwnerAndPermission read(DataInput input) throws IOException {
     oap.readFields(input);
     return oap;
   }
+
+  /**
+   * given a file, return whether the metadata for the file match the current owner and permission
+   * note: if field is null, we always think it's match as no update needed.
+   * @param file the file status that need to be evaluated
+   * @return true if the metadata for the file match the current owner and permission
+   */
+  public boolean isHavingSameOwnerAndPermission(FileStatus file) {

Review Comment:
   I like this method!
   
   minor note: "has" is a much more idiomatic way to say "isHaving"



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/writer/FileAwareInputStreamDataWriter.java:
##########
@@ -387,15 +389,15 @@ private void setRecursivePermission(Path path, OwnerAndPermission ownerAndPermis
     Collections.reverse(files);
 
     for (FileStatus file : files) {
-      safeSetPathPermission(this.fs, file.getPath(), addExecutePermissionsIfRequired(file, ownerAndPermission));
+      safeSetPathPermission(this.fs, file, ownerAndPermission);
     }
   }
 
   /**
    * The method makes sure it always grants execute permissions for an owner if the <code>file</code> passed is a
    * directory. The publisher needs it to publish it to the final directory and list files under this directory.
    */
-  public static OwnerAndPermission addExecutePermissionsIfRequired(FileStatus file,
+  public static OwnerAndPermission setOwnerExecuteBitIfDirectory(FileStatus file,

Review Comment:
   now that you call from within, can it go back to `private`?



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/CopyConfiguration.java:
##########
@@ -51,6 +51,8 @@ public class CopyConfiguration {
   public static final String COPY_PREFIX = "gobblin.copy";
   public static final String PRESERVE_ATTRIBUTES_KEY = COPY_PREFIX + ".preserved.attributes";
   public static final String PRESERVE_MODTIME_FOR_DIR = COPY_PREFIX + ".preserve.mod.time.for.dir";
+  //Setting this to be false means we will not overwrite the owner and permission if dir already exist on dst
+  public static final String ALWAYS_OVERWRITE_DIR_OWNER_AND_PERMISSION = COPY_PREFIX + ".alwaysOverwriteDirOwnerAndPermission";

Review Comment:
   two thoughts here:
   1. isn't this manifest-file-specific?  for other finders, won't dirs already on dest be skipped, whereas only manifest-based copy can ensure 'always'?  if so, let's qualify the name accordingly.
   2. "always overwrite" may not be quite accurate, since that would imply we always do work, even when nothing to change (the fact might be reflected in an updated mod time).  perhaps more accurately something like `resyncOwnerAndPermissionEvenIfDirExists`.



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -124,29 +131,15 @@ public Iterator<FileSet<CopyEntity>> getFileSetIterator(FileSystem targetFs, Cop
   }
 
   private static boolean shouldCopy(FileStatus fileInSource, FileStatus fileInTarget, CopyConfiguration copyConfiguration) {
-    if (fileInSource.isDirectory()) {
-      // If dir is specified in the manifest, we blindly syn them on the metadata without concerning the mod time
-      return true;
+    if (fileInSource.isDirectory() || fileInSource.getModificationTime() == fileInTarget.getModificationTime()) {
+      // if source is dir or source and dst has same version, we compare the permission to determine whether it needs another sync
+      OwnerAndPermission replicatedPermission = CopyableFile.resolveReplicatedOwnerAndPermission(fileInSource, copyConfiguration);
+      return !replicatedPermission.isHavingSameOwnerAndPermission(fileInTarget);
     }
     if (fileInSource.getModificationTime() > fileInTarget.getModificationTime()) {

Review Comment:
   should be equivalent to replace `if` with `return` and leave off the rest of the method...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org