You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/29 00:37:48 UTC

[GitHub] [incubator-hudi] prashantwason opened a new pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

prashantwason opened a new pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289
 
 
   ## What is the purpose of the pull request
   
   HUDI DAG stages do not have any names. The Spark History Server UI shows these stages with the HUDI JAVA filename and the line number. 
   
   This change provides descriptive names for the stages which makes it easier to visualize the HUDI DAG.
   
   
   ## Brief change log
   1. All code locations where JavaSparkContext.XXX (DAG creation methods like parallelize) are used have been updated with descriptive names for the job with the following API
      jsc.setJobGroup(title, description);
   
   2. Title is the class name so its easy to identify. Description is the activity for that stage.
   
   3. Unit test code has been updates to enable the unit tests to write the Spark event logs so that even the unit tests can be visualized in a locally installed  Spark History Server UI.
   
   4. The unit tests need to be run as follows:
   
   mvn test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
   
   Once the unit tests complete, the Spark History Server shows the application logs for all completed unit tests. The unit tests themselves can be identified by the test-class-name as the application name.
   
   ## Verify this pull request
   
   -- This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   1. All existing unit tests pass.
   2. If the SPARK_EVLOG_DIR property is not set, spark event logging is not enabled (this is defauly behavior)
   3. Manually verified the change by running unit tests locally and observing the application logs on the locally installed Spark History Server.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#discussion_r375665830
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/CompactionAdminClient.java
 ##########
 @@ -356,6 +357,7 @@ private ValidationOpResult validateCompactionOperation(HoodieTableMetaClient met
     } else {
       LOG.info("The following compaction renaming operations needs to be performed to un-schedule");
       if (!dryRun) {
+        jsc.setJobGroup(this.getClass().getSimpleName(), "Execute renaming operations");
 
 Review comment:
   Change to `Execute unschedule operations` 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-599720266
 
 
   @prashantwason still driving this? Can I help get this moving along? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-606167919
 
 
   @vinothchandar Yep, I would like this to move forward. Let me revive this as it seems there are merge conflicts now.
   
   We dont have this deployed yet so the only DAG screenshots I can provide are from unit tests. On you end, can you provide me some specific unit tests which are exercising the specific DAGs you are interested in? I can generate the screenshots from those.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-590512271
 
 
   @prashantwason Wondering if you have some screenshots for the upsert dag.. (I can try running tjhe PR locally if not) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-587712891
 
 
   >So we need to label each stage and they should show up correctly.
   
   if we could do that, and if you can post the `upsert()` dag for example, that would be great.. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] vinothchandar commented on pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #1289:
URL: https://github.com/apache/hudi/pull/1289#issuecomment-644114816


   @n3nash this has been since reassigned.. feel free to grab this review, I have few others queued up before I can get to this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] prashantwason commented on pull request #1289: [WIP] [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on pull request #1289:
URL: https://github.com/apache/hudi/pull/1289#issuecomment-636255758


   List of executed tests - the name is made up of the test and the method
   ![Screenshot (1)](https://user-images.githubusercontent.com/58448203/83316516-13c7ea80-a1db-11ea-8cb2-e998662ec5b0.png)
   
   Various DAGs
   ![Screenshot (2)](https://user-images.githubusercontent.com/58448203/83316518-17f40800-a1db-11ea-9adc-5ce32f16308d.png)
   ![Screenshot (3)](https://user-images.githubusercontent.com/58448203/83316519-1aeef880-a1db-11ea-8407-a35eb470ab07.png)
   ![Screenshot (4)](https://user-images.githubusercontent.com/58448203/83316520-1cb8bc00-a1db-11ea-9bef-ce1a98766dda.png)
   ![Screenshot (5)](https://user-images.githubusercontent.com/58448203/83316522-1f1b1600-a1db-11ea-8423-ec80ae4984e7.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] n3nash commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-579907841
 
 
   @prashantwason I think you tried another approach using aspects to make the code look cleaner right ? Could you please briefly describe that approach here (pros and cons) here so reviewers are also aware of it ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#discussion_r375667569
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java
 ##########
 @@ -254,6 +255,7 @@ private int determineParallelism(int inputParallelism, int totalSubPartitions) {
 
     if (config.getBloomIndexPruneByRanges()) {
       // also obtain file ranges, if range pruning is enabled
+      jsc.setJobDescription("Obtain file ranges as range pruning is enabled");
 
 Review comment:
   `Obtain key ranges for file slices (range pruning=on)`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] codecov-commenter edited a comment on pull request #1289: [WIP] [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1289:
URL: https://github.com/apache/hudi/pull/1289#issuecomment-636262398


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=h1) Report
   > Merging [#1289](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=desc) into [master](https://codecov.io/gh/apache/hudi/commit/5a0d3f1cf963e0061364d915ac86a465dd079bac&el=desc) will **increase** coverage by `0.00%`.
   > The diff coverage is `20.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/1289/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff            @@
   ##             master    #1289   +/-   ##
   =========================================
     Coverage     18.19%   18.19%           
     Complexity      856      856           
   =========================================
     Files           348      348           
     Lines         15344    15359   +15     
     Branches       1523     1523           
   =========================================
   + Hits           2792     2795    +3     
   - Misses        12195    12207   +12     
     Partials        357      357           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../org/apache/hudi/client/CompactionAdminClient.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0NvbXBhY3Rpb25BZG1pbkNsaWVudC5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==) | `39.53% <0.00%> (-0.63%)` | `22.00 <0.00> (ø)` | |
   | [...e/hudi/table/action/clean/CleanActionExecutor.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NsZWFuL0NsZWFuQWN0aW9uRXhlY3V0b3IuamF2YQ==) | `10.52% <0.00%> (-0.17%)` | `5.00 <0.00> (ø)` | |
   | [...ction/compact/HoodieMergeOnReadTableCompactor.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbXBhY3QvSG9vZGllTWVyZ2VPblJlYWRUYWJsZUNvbXBhY3Rvci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...on/rollback/MergeOnReadRollbackActionExecutor.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3JvbGxiYWNrL01lcmdlT25SZWFkUm9sbGJhY2tBY3Rpb25FeGVjdXRvci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...che/hudi/table/action/rollback/RollbackHelper.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3JvbGxiYWNrL1JvbGxiYWNrSGVscGVyLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...able/action/savepoint/SavepointActionExecutor.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3NhdmVwb2ludC9TYXZlcG9pbnRBY3Rpb25FeGVjdXRvci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...n/java/org/apache/hudi/index/HoodieIndexUtils.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXhVdGlscy5qYXZh) | `47.61% <100.00%> (+2.61%)` | `3.00 <1.00> (ø)` | |
   | [.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh) | `59.00% <100.00%> (+0.41%)` | `16.00 <0.00> (ø)` | |
   | [...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh) | `55.71% <100.00%> (+0.31%)` | `15.00 <0.00> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=footer). Last update [5a0d3f1...ed384e6](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-582587735
 
 
   Spark History Server Screenshots
   
   ![List of tests](https://user-images.githubusercontent.com/58448203/73878606-b1a03f80-480f-11ea-89c0-13da2feb244c.png)
   
   
   ![Event timeline](https://user-images.githubusercontent.com/58448203/73878671-d1cffe80-480f-11ea-8ea7-770f80c4213a.png)
   
   
   ![TestHoodieClientCopyOnWrite](https://user-images.githubusercontent.com/58448203/73878634-c086f200-480f-11ea-8b4f-96c810ee6fbf.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] nsivabalan commented on pull request #1289: [WIP] [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #1289:
URL: https://github.com/apache/hudi/pull/1289#issuecomment-633148382


   @prashantwason : Once you update the PR, do let @lamber-ken know that its ready for review. 
   @lamber-ken : would you mind reviewing this PR. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-582758667
 
 
   @prashantwason this is so awesome! Started reviewing this .. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-582759343
 
 
   >The unit tests need to be run as follows:
   > mvn test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
   lets add this to the README, under a new section `Running Tests` 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-579907245
 
 
   @bvaradar @vinothchandar can you guys help review this ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-581787335
 
 
   @prashantwason  this is a great contribution for anyone debugging hudi writing... Can you post some screenshots for how upsert/bulk_insert dags now show up on the UI? 
   
   also @n3nash if you want to review this, feel free to grab this from me 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] n3nash commented on pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
n3nash commented on pull request #1289:
URL: https://github.com/apache/hudi/pull/1289#issuecomment-655266313


   @prashantwason can you rebase and push please ? I can then merge this


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
n3nash commented on pull request #1289:
URL: https://github.com/apache/hudi/pull/1289#issuecomment-643380927


   @vinothchandar could you take a look at the screenshots and see if that provides what you were looking for ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar edited a comment on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-582759343
 
 
   >The unit tests need to be run as follows:
   > mvn test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
   
   lets add this to the README, under a new section `Running Tests` 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#discussion_r375665978
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/CompactionAdminClient.java
 ##########
 @@ -398,6 +400,7 @@ private ValidationOpResult validateCompactionOperation(HoodieTableMetaClient met
           "Number of Compaction Operations :" + plan.getOperations().size() + " for instant :" + compactionInstant);
       List<CompactionOperation> ops = plan.getOperations().stream()
           .map(CompactionOperation::convertFromAvroRecordInstance).collect(Collectors.toList());
+      jsc.setJobGroup(this.getClass().getSimpleName(), "Generate renaming operations");
 
 Review comment:
   change to `Generate compaction unscheduling operations` ? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#discussion_r375668006
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##########
 @@ -298,6 +298,7 @@ public HoodieCleanerPlan scheduleClean(JavaSparkContext jsc) {
       int cleanerParallelism = Math.min(partitionsToClean.size(), config.getCleanerParallelism());
       LOG.info("Using cleanerParallelism: " + cleanerParallelism);
 
+      jsc.setJobGroup(this.getClass().getSimpleName(), "Generates List of files to be cleaned");
 
 Review comment:
   I know the comments say `files` and you are just using that. but would be nice to stick to our terminologies as much as possible. https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] prashantwason commented on pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on pull request #1289:
URL: https://github.com/apache/hudi/pull/1289#issuecomment-657742470


   @n3nash  I have rebased the changes. Build is green.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-commenter commented on pull request #1289: [WIP] [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1289:
URL: https://github.com/apache/hudi/pull/1289#issuecomment-636262398


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=h1) Report
   > Merging [#1289](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=desc) into [master](https://codecov.io/gh/apache/hudi/commit/5a0d3f1cf963e0061364d915ac86a465dd079bac&el=desc) will **increase** coverage by `0.00%`.
   > The diff coverage is `20.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/1289/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff            @@
   ##             master    #1289   +/-   ##
   =========================================
     Coverage     18.19%   18.19%           
     Complexity      856      856           
   =========================================
     Files           348      348           
     Lines         15344    15359   +15     
     Branches       1523     1523           
   =========================================
   + Hits           2792     2795    +3     
   - Misses        12195    12207   +12     
     Partials        357      357           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [.../org/apache/hudi/client/CompactionAdminClient.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0NvbXBhY3Rpb25BZG1pbkNsaWVudC5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==) | `39.53% <0.00%> (-0.63%)` | `22.00 <0.00> (ø)` | |
   | [...e/hudi/table/action/clean/CleanActionExecutor.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NsZWFuL0NsZWFuQWN0aW9uRXhlY3V0b3IuamF2YQ==) | `10.52% <0.00%> (-0.17%)` | `5.00 <0.00> (ø)` | |
   | [...ction/compact/HoodieMergeOnReadTableCompactor.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbXBhY3QvSG9vZGllTWVyZ2VPblJlYWRUYWJsZUNvbXBhY3Rvci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...on/rollback/MergeOnReadRollbackActionExecutor.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3JvbGxiYWNrL01lcmdlT25SZWFkUm9sbGJhY2tBY3Rpb25FeGVjdXRvci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...che/hudi/table/action/rollback/RollbackHelper.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3JvbGxiYWNrL1JvbGxiYWNrSGVscGVyLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...able/action/savepoint/SavepointActionExecutor.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3NhdmVwb2ludC9TYXZlcG9pbnRBY3Rpb25FeGVjdXRvci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | [...n/java/org/apache/hudi/index/HoodieIndexUtils.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXhVdGlscy5qYXZh) | `47.61% <100.00%> (+2.61%)` | `3.00 <1.00> (ø)` | |
   | [.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh) | `59.00% <100.00%> (+0.41%)` | `16.00 <0.00> (ø)` | |
   | [...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/hudi/pull/1289/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh) | `55.71% <100.00%> (+0.31%)` | `15.00 <0.00> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=footer). Last update [5a0d3f1...ed384e6](https://codecov.io/gh/apache/hudi/pull/1289?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-587701211
 
 
   > 1. It seems like you are only covering cases where an RDD is getting created? Is it possible to change the job group between stages?
   
   The setJobGroup() description applies to the Thread and is used until either it is updated or removed. So we need to label each stage and they should show up correctly.
   
   We can label RDD creations as well as operations on them.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#discussion_r375667190
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##########
 @@ -586,6 +586,7 @@ public boolean savepoint(String commitTime, String user, String comment) {
           HoodieTimeline.compareTimestamps(commitTime, lastCommitRetained, HoodieTimeline.GREATER_OR_EQUAL),
           "Could not savepoint commit " + commitTime + " as this is beyond the lookup window " + lastCommitRetained);
 
+      jsc.setJobGroup(this.getClass().getSimpleName(), "Collecting latest files in partition");
 
 Review comment:
   In general lets provide some context into what higher level context, the action is being performed i.e savepoints, compaction, rollbacks. etc . In that spirit, change to `Collecting latest files for savepoint` ? 
   
   Also wonder if we can include the `commitTime` in the detail i.e `Collecting latest files for savepoint 20200205010000`. This way, you can just go to past runs on spark history server and relate them to commits on hudi.. Even better, if someone is running deltastreamer in continuous mode, then they can see activity for commits over time 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] prashantwason commented on pull request #1289: [WIP] [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on pull request #1289:
URL: https://github.com/apache/hudi/pull/1289#issuecomment-636255594


   @lamber-ken  I have updated the PR and added screenshots from the Spark History Server UI for some of the MOR table operations. Please have a look.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#discussion_r375668377
 
 

 ##########
 File path: hudi-client/src/test/java/org/apache/hudi/HoodieClientTestHarness.java
 ##########
 @@ -107,11 +107,12 @@ protected void initSparkContexts(String appName) {
   }
 
   /**
-   * Initializes the Spark contexts ({@link JavaSparkContext} and {@link SQLContext}) with a default name
-   * <b>TestHoodieClient</b>.
+   * Initializes the Spark contexts ({@link JavaSparkContext} and {@link SQLContext}) 
+   * with a default name matching the name of the class.
    */
   protected void initSparkContexts() {
-    initSparkContexts("TestHoodieClient");
+    String ctxName = this.getClass().getSimpleName() + "#" + testName.getMethodName();
+    initSparkContexts(ctxName);
 
 Review comment:
   can we do this in a single line? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [hudi] vinothchandar merged pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar merged pull request #1289:
URL: https://github.com/apache/hudi/pull/1289


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-610432304
 
 
   TestMergeOnReadTable or TestClientCopyOnWriteStorage etc that will do a full upsert dag for cow and mor are good starting points.. but really, we need to run an upsert with a real job to ensure these values also show up in real deployments

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-590515904
 
 
   I dont have them yet. I can run any specific Hudi test already committed
   and quickly get the screenshot if that helps.
   
   On Mon, Feb 24, 2020 at 11:39 AM vinoth chandar <no...@github.com>
   wrote:
   
   > @prashantwason
   > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_prashantwason&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=krc72Vl8fn-BAV9Uw7bypYN5QdpKdBETNBLTRmB0Lxg&s=GjZTgUvDI4lH7V_gFYBDEoTDzLa1ymVz_0l4ErWsUqk&e=>
   > Wondering if you have some screenshots for the upsert dag.. (I can try
   > running tjhe PR locally if not)
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_pull_1289-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAN55SS53LAOGY63I2KHAEFTREQPAXA5CNFSM4KM4XRH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMZIBDY-23issuecomment-2D590512271&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=krc72Vl8fn-BAV9Uw7bypYN5QdpKdBETNBLTRmB0Lxg&s=j_VwpAFSuK81XDMj5bLlrsR-alMRwPkTRoVwEwLiDwE&e=>,
   > or unsubscribe
   > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AN55SSZKYZC4UBXW474KUALREQPAXANCNFSM4KM4XRHQ&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=krc72Vl8fn-BAV9Uw7bypYN5QdpKdBETNBLTRmB0Lxg&s=jAfj9t0L3_KFQ1dOdg7Djlcv5tUbbUe0ootOGzMuBjo&e=>
   > .
   >
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#discussion_r375667688
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/io/compact/HoodieMergeOnReadTableCompactor.java
 ##########
 @@ -94,6 +94,7 @@
         .map(CompactionOperation::convertFromAvroRecordInstance).collect(toList());
     LOG.info("Compactor compacting " + operations + " files");
 
+    jsc.setJobGroup(this.getClass().getSimpleName(), "Compacting files");
 
 Review comment:
   Compacting file slices?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-579944539
 
 
   A DAG stage name and description can be set using the JavaSparkContext.setJobDescription(...) method. The same name/description is used for all stages which use the same thread until the name/description is updated (another call to setJobDescription) or deleted (clearJobGroup).
   
   In this PR, I am using the ClassName as the stage name and a textual description derived from the method logic. HUDI classes have very descriptive names so this works well.
   
   There are two ways this may be done:
   1. Manually (this PR) by adding code set the name/description before any DAG stages are started. 
   2. Using Java AOP to automatically find code locations matching some pattern and augment them with the call to setJobDescription. 
   
   To use AOP approach, we can create a separate AspectJ file containing the Pointcuts (code locations to augment) and Advices (code to insert). There is a separate AspectJ compiler which at runtime can change the class bytecode to add the Advices. 
   
   Pros of AOP approach:
   1. Does not require any change in current code
   2. Also covers future code automatically
   3. Easy to undo (just don't run the AspectJ compiler as part of build)
   4. Can be extended to more use-cases like automating Metrics.
   
   Cons of AOP approach:
   1. Require AspectJ and its compiler to be integrated with the HUDI build chain
   2. The Advice cannot be dynamic. Hence we cannot provide descriptions to the DAG stages (we can still use the class name as the DAG stage name). 
   
   Since the code has a manageable number of places where DAG is created, I prefer the simpler manual approach. It also ends up documenting the code.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1289:
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-618015542


   @prashantwason this will be a good candidate to fast track into the next release. are you still working on this? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org