You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2022/03/13 16:01:12 UTC

[GitHub] [hadoop] attilapiros commented on pull request #2971: MAPREDUCE-7341. Intermediate Manifest Committer

attilapiros commented on pull request #2971:
URL: https://github.com/apache/hadoop/pull/2971#issuecomment-1066131080


   I had an idea for testing this committer (and some of the others too). 
   Actually I started to put it together but now I am a bit uncertain how valuable this will be so I decided to share it with you in the current state before I waste too much time.
   
   The basic idea is simple: as everything is based on the output directory content we can check the unified diff after each of the operations. One possibility is to represent the whole directory content as string (here the `before` and `after` just arbitrary labels).
   
   My asserts become:
   ```java
       final Path textOutputPath = writeTextOutput(tContext);
       dirState = assertDirStateDiff(dirState, Arrays.asList(
           "--- before",
           "+++ after",
           "@@ -1,0 +1,9 @@",
           "+file:///_temporary/job_ID_0001/01/tasks/attempt_ID_0001_m_000000_0/part-m-00000",
           "+~~~",
           "+key1\tval1",
           "+val1",
           "+val2",
           "+key2",
           "+key1",
           "+key2\tval2",
           "+~~~"));
   ```
   
   As you see I replaced the changing IDs with an _ID_  literal and from the absolute paths I have removed the output directory part.
   
   And here comes what I don't like (the huge json files):
   ```java
     commitTask(committer, tContext);
     dirState = assertDirStateDiff(dirState, Arrays.asList(
       "--- before",
       "+++ after",
       "@@ -9,1 +9,24 @@",
       " ~~~",
       "+file:///_temporary/job_ID_0001/01/manifests/task_ID_0001_m_000000-manifest.json",
       "+~~~",
       "+{",
       "+  \"extraData\" : { },",
       "+  \"type\" : \"org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.TaskManifest/1\",",
       "+  \"version\" : 1,",
       "+  \"jobId\" : \"job_ID_0001\",",
       "+  \"jobAttemptNumber\" : 1,",
       "+  \"taskID\" : \"task_ID_0001_m_000000\",",
       "+  \"taskAttemptID\" : \"attempt_ID_0001_m_000000_0\",",
       "+  \"taskAttemptDir\" : \"file:/_temporary/job_ID_0001/01/tasks/attempt_ID_0001_m_000000_0\",",
       "+  \"files\" : [ {",
       "+    \"s\" : \"file:///_temporary/job_ID_0001/01/tasks/attempt_ID_0001_m_000000_0/part-m-00000\",",
       "+    \"d\" : \"file:/part-m-00000\",",
       "+    \"z\" : 40",
       "+  } ],",
       "+  \"directories\" : [ ],",
       "+  \"iostatistics\" : {",
       "+    \"counters\" : {",
       "+      \"committer_commit_job\" : 0,",
       "+      \"op_msync\" : 0,",
       "+      \"op_msync.failures\" : 0,",
       "+~~~"));
   ``` 
   Here `assertDirStateDiff` always cut the files longer then 20 lines. But we could use a an external resource too for the longer files.
   
   The code is not that complex:
   
   ```java
     private static List<String> unifiedDiff(List<String> before, List<String> after) {
       List<String> res =
         UnifiedDiffUtils.generateUnifiedDiff("before", "after", before, DiffUtils.diff(before, after), 1);
       // LOG.info("{}", "\"" + String.join("\", \"", res) + "\"");
       return res;
     }
   
     private static String replaceChangingParts(String replace, String input) {
       return input.replaceFirst("job_[0-9]*_", "job_ID_")
         .replaceFirst("attempt_[0-9]*_", "attempt_ID_")
         .replaceFirst("task_[0-9]*_", "task_ID_")
         .replace(replace, "");
     }
   
     private void getOutputDirContent(List<String> lines, Path path) throws IOException {
       FileStatus[] statuses = getFileSystem().listStatus(path);
       for (FileStatus status : statuses) {
         if (status.isDirectory()) {
           getOutputDirContent(lines, status.getPath());
         } else {
           lines.add(replaceChangingParts(getOutputDir().toUri().getPath().toString(), status.getPath().toUri().toString()));
           lines.add("~~~");
           try(FSDataInputStream in = getFileSystem().open(status.getPath());
               InputStreamReader is = new InputStreamReader(in);
               BufferedReader br = new BufferedReader(is)) {
             br.lines().limit(20).forEach((l) -> lines.add(replaceChangingParts(getOutputDir().toUri().getPath().toString(), l)));
           }
           lines.add("~~~");
         }
       }
     }
   
     private List<String> assertDirStateDiff(List<String> previousDirState, List<String> diffList) throws IOException {
       List<String> currentDirState = getOutputDirContent();
       Assertions.assertThat(unifiedDiff(previousDirState, currentDirState))
         .isEqualTo(diffList);
       return currentDirState;
     }
   
     private List<String> getOutputDirContent() throws IOException {
       List<String> lines = new ArrayList<>();
       getOutputDirContent(lines, getOutputDir());
       return lines;
     }
     ```
   
   For the unified diff I am using a new dependency (it has Apache-2.0 License):
   ```
       <dependency>
         <groupId>io.github.java-diff-utils</groupId>
         <artifactId>java-diff-utils</artifactId>
         <version>4.9</version>
       </dependency>
   ```
   
   I think one of the advantage is `getOutputDirContent` is always a full scan and if something need to be removed it will be in contained in the unified diff and never be left out.
   Do you see any value in this idea? Is not this too much restrictions on this code? 
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org