You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/29 16:00:40 UTC

[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

lamber-ken opened a new pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290
 
 
   ## What is the purpose of the pull request
   
   From [spark-avro-guide](http://spark.apache.org/docs/latest/sql-data-sources-avro.html), we know that the spark-avro module is external, it is not exists in [spark-2.4.4-bin-hadoop2.7.tgz](http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz).
   
   **Shade spark-avro:**
   - User will not use `--packages org.apache.spark:spark-avro_2.11:2.4.4`
   - User also don't need care about using the `scala-2.11` or `scala-2.12`
   
   ## Brief change log
   
     - *Relocate spark-avro dependency by maven-shade-plugin*
   
   ## Verify this pull request
   
   ```
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   
   ${SPARK_HOME}/bin/spark-shell \
       --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
       --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   val basePath = "file:///tmp/hudi_mor_table"
   
   var datas = List("""{ "name": "kenken", "ts": 1574297893836, "age": 12, "location": "latitude"}""")
   val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
   df.write.format("org.apache.hudi").
       option("hoodie.datasource.write.recordkey.field", "name").
       option("hoodie.datasource.write.partitionpath.field", "location").
       option("hoodie.datasource.write.precombine.field", "ts").
       option("hoodie.table.name", "tableName").
       mode("Overwrite").
       save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
   ```
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#discussion_r372707430
 
 

 ##########
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
 ##########
 @@ -104,7 +104,7 @@ public String showLogFileCommits(
           try {
             instantTime = n.getLogBlockHeader().get(HeaderMetadataType.INSTANT_TIME);
             if (instantTime == null) {
-              throw new Exception("Invalid instant time " + instantTime);
 
 Review comment:
   Right, I agree it's a minor change and doesn't need a new PR. But I don't think we really need this change

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-593252380
 
 
   hi @bhasudha, addressed and add `--packages` to READ.ME doc.   Thanks 👍 
   ```
   spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
     --packages org.apache.spark:spark-avro_2.11:2.4.4 \
     --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582654435
 
 
   > > If the user wants some changes in spark-avro:3.0-preview2, the right way is modify spark-version to 3.0-preview2 at pom.xml file, then build hudi project source.
   > 
   > Understood... but we would like to avoid this need for building a version of hudi by themselves. Users should be able to do `--packages <hudi-spark-bundle*.jar>` and be on their way.. This change forces everyone who is not on the spark version used by hudi to do their own builds.. I still feel this is not a good idea.
   > 
   > Once again, I wish we can have more upfront discussion on the JIRA, rather than on a PR, around issues like this. (I feel bad when someone puts in the work to do the implementation that we cannot take :()
   
   +1.  Good catch. @lamber-ken I think Vinoth brings up a valid point. Although your PR intends to make it easier for users to not care about scala 2.11 or scala 2.12, we also need to avoid coupling Hudi with specific spark_avro versions be it 2.4.4 or 3.0-preview2.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#discussion_r372707430
 
 

 ##########
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
 ##########
 @@ -104,7 +104,7 @@ public String showLogFileCommits(
           try {
             instantTime = n.getLogBlockHeader().get(HeaderMetadataType.INSTANT_TIME);
             if (instantTime == null) {
-              throw new Exception("Invalid instant time " + instantTime);
 
 Review comment:
   Right, I agree it's a minor change and doesn't need a new PR. But I don't think we need this change

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582663093
 
 
   Hi @vinothchandar @bhasudha @umehrot2, thanks you all talk about this here, I thinks it's a meaningful discusstion. I wrote a email to talk about this, more discussion about this, please using email, thanks. :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582669261
 
 
   Hi @vinothchandar, sorry, I didn't know you and @umehrot2 had talked about this before. 😢  
   https://github.com/apache/incubator-hudi/pull/1005#pullrequestreview-340275874

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-593252380
 
 
   hi @bhasudha, addressed and add `--packages` to READ.ME doc
   ```
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
     --packages org.apache.spark:spark-avro_2.11:2.4.4 \
     --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-593244136
 
 
   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=h1) Report
   > :exclamation: No coverage uploaded for pull request base (`master@078d482`). [Click here to learn what that means](https://docs.codecov.io/docs/error-reference#section-missing-base-commit).
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1290/graphs/tree.svg?width=650&token=VTTXabwbs2&height=150&src=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff            @@
   ##             master    #1290   +/-   ##
   =========================================
     Coverage          ?   67.09%           
     Complexity        ?      223           
   =========================================
     Files             ?      333           
     Lines             ?    16216           
     Branches          ?     1659           
   =========================================
     Hits              ?    10880           
     Misses            ?     4598           
     Partials          ?      738
   ```
   
   
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=footer). Last update [078d482...7be999a](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582248349
 
 
   Thanks for review this pr @vinothchandar, this pr does not lock the user into `spark-avro:2.4.4`. 
   
   If the user wants some changes in `spark-avro:3.0-preview2`, the right way is modify spark-version to `3.0-preview2` at pom.xml file, then build hudi project source.
   
   https://github.com/apache/incubator-hudi/blob/4de0fcfcb54ac76ed3b6852917588c32fec9bea8/pom.xml#L95
   
   ```
   <dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-avro_${scala.binary.version}</artifactId>
     <version>${spark.version}</version>
     <scope>provided</scope>
   </dependency>
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken removed a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken removed a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-593237472
 
 
   hi @bhasudha, all review comments are addressed and fixed. Thanks
   
   ![image](https://user-images.githubusercontent.com/20113411/75649979-9fd77e00-5c8f-11ea-907d-51774493c050.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-593244136
 
 
   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=h1) Report
   > :exclamation: No coverage uploaded for pull request base (`master@078d482`). [Click here to learn what that means](https://docs.codecov.io/docs/error-reference#section-missing-base-commit).
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1290/graphs/tree.svg?width=650&token=VTTXabwbs2&height=150&src=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff            @@
   ##             master    #1290   +/-   ##
   =========================================
     Coverage          ?   67.09%           
     Complexity        ?      223           
   =========================================
     Files             ?      333           
     Lines             ?    16216           
     Branches          ?     1659           
   =========================================
     Hits              ?    10880           
     Misses            ?     4598           
     Partials          ?      738
   ```
   
   
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=footer). Last update [078d482...ac87b91](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#discussion_r372577925
 
 

 ##########
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
 ##########
 @@ -104,7 +104,7 @@ public String showLogFileCommits(
           try {
             instantTime = n.getLogBlockHeader().get(HeaderMetadataType.INSTANT_TIME);
             if (instantTime == null) {
-              throw new Exception("Invalid instant time " + instantTime);
 
 Review comment:
   just leave it as instantTime ? (also this change is different from what the PR says so lets not have it in this PR)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582628959
 
 
   > > If the user wants some changes in spark-avro:3.0-preview2, the right way is modify spark-version to 3.0-preview2 at pom.xml file, then build hudi project source.
   > 
   > Understood... but we would like to avoid this need for building a version of hudi by themselves. Users should be able to do `--packages <hudi-spark-bundle*.jar>` and be on their way.. This change forces everyone who is not on the spark version used by hudi to do their own builds.. I still feel this is not a good idea.
   > 
   > Once again, I wish we can have more upfront discussion on the JIRA, rather than on a PR, around issues like this. (I feel bad when someone puts in the work to do the implementation that we cannot take :()
   
   Thanks, got it. Welcome, any suggestion is good to me. will file a disscuss email about this. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-593237472
 
 
   hi @bhasudha, all review comments are addressed and fixed. Thanks
   
   ![image](https://user-images.githubusercontent.com/20113411/75649979-9fd77e00-5c8f-11ea-907d-51774493c050.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
bhasudha commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#discussion_r386190496
 
 

 ##########
 File path: packaging/hudi-spark-bundle/pom.xml
 ##########
 @@ -248,6 +260,13 @@
         <spark.bundle.hive.shade.prefix>org.apache.hudi.</spark.bundle.hive.shade.prefix>
       </properties>
     </profile>
+    <profile>
+      <id>spark-shade-unbundle-avro</id>
+      <properties>
+        <spark.bundle.avro.scope>provided</spark.bundle.avro.scope>
+        <spark.bundle.spark.shade.prefix>/</spark.bundle.spark.shade.prefix>
 
 Review comment:
   @lamber-ken I see a '/' here for the property "spark.bundle.spark.shade.prefix" if the profile is activated. Is this intended? Does this mean shade with empty prefix (same as not shading)? I dont know much about profiling either so wanted to clarify with you.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#discussion_r372820628
 
 

 ##########
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
 ##########
 @@ -104,7 +104,7 @@ public String showLogFileCommits(
           try {
             instantTime = n.getLogBlockHeader().get(HeaderMetadataType.INSTANT_TIME);
             if (instantTime == null) {
-              throw new Exception("Invalid instant time " + instantTime);
 
 Review comment:
   @n3nash follow you, reverted.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
umehrot2 edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582650497
 
 
   As @vinothchandar mentioned we did discuss the possibility of doing this, but were thrown back because of the reason mentioned above by Vinoth.
   
   But just thinking out aloud, now that Hudi has been migrated to `spark 2.4.4` is it really recommended for user's using older or newer versions of spark version to be simply dropping the pre-built hudi jars with spark 2.4.4 ? Shouldn't the recommendation for such users be to actually build their own hudi jars, with the spark version they use. This would help them catch any possible compatibility issues during compile time. I am not sure how safe it is for us to claim that user's use our pre-built jars regardless of spark versions.
   
   And if building own jars is the recommendation, then i think this change can be pulled in.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha merged pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
bhasudha merged pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-580007687
 
 
   Thanks @lamber-ken This is very useful. Couple of requests. 
   - Were you able to verify that this works against spark-3.0-preview2 and scala 2.12 (just for double confirmation)?
   - Also this would need documentation changes especially quickstart page and may be docker-demo pages? Can you also track that as a subtask of HUDI-584 or may be a new Jira if needed? 
   
   LGTM otherwise.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#discussion_r386211419
 
 

 ##########
 File path: packaging/hudi-spark-bundle/pom.xml
 ##########
 @@ -248,6 +260,13 @@
         <spark.bundle.hive.shade.prefix>org.apache.hudi.</spark.bundle.hive.shade.prefix>
       </properties>
     </profile>
+    <profile>
+      <id>spark-shade-unbundle-avro</id>
+      <properties>
+        <spark.bundle.avro.scope>provided</spark.bundle.avro.scope>
+        <spark.bundle.spark.shade.prefix>/</spark.bundle.spark.shade.prefix>
 
 Review comment:
   Good catch 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582669261
 
 
   Hi @vinothchandar, I didn't know you and @umehrot2 had talked about this before. 😢  
   https://github.com/apache/incubator-hudi/pull/1005#pullrequestreview-340275874

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582650497
 
 
   As @vinothchandar mentioned we did discuss the possibility of doing this, but were thrown back because of the reason mentioned above by Vinoth.
   
   But just thinking out aloud, now that Hudi has been migrated to `spark 2.4.4` is it really recommended for user's using older or newer versions of spark version to be simply dropping the pre-built hudi jars with spark 2.4.4 ? Shouldn't the recommendation for such users be to actually build their own hudi jars, with the spark version they use. This would help them catch any possible compatibility issues during compile time. I am not sure how safe it is for us to claim that user's use for our pre-built jars regardless of spark versions.
   
   And if building own jars is the recommendation, then i think this change can be pulled in.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582238771
 
 
   @umehrot2 as well.. We considered doing this when we moved to `spark-avro` actually. The concern was that the bundle now has a dependency on the spark version used by Hudi.. i.e we use spark 2.4.4 and the spark bundle would contain `spark-avro:2.4.4`. 
   
   I believe this may be working with `spark-3.0-preview2` now. But what if the user wants some changes in `spark-avro:3.0-preview2` (esp with avro <=> row datatype conversions that keep coming up)? They would have to make a custom build right?  
   
   In short, I don't know if this is a good idea..  
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582661108
 
 
   hi @bhasudha, `Spark 3.0.0-preview2` is pre-built with Scala 2.12, user need to build hudi with scala-2.12 first. You can download spark-preview package from [here](https://www.apache.org/dyn/closer.lua/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz).
   
   ![image](https://user-images.githubusercontent.com/20113411/73892288-78201080-48b1-11ea-8a68-ff4473640694.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-580144740
 
 
   hi @bhasudha, you're welcome. This works well with ` spark-3.0-preview2 and scala 2.12`.
   
   ```
   wget http://mirror.bit.edu.cn/apache/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
   
   
   dev/change-scala-version.sh 2.12
   mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true
   
   export SPARK_HOME=/work/BigData/install/spark/spark-3.0.0-preview2-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
       --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
       --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   val basePath = "file:///tmp/hudi_mor_table"
   
   var datas = List("""{ "name": "kenken", "ts": 1574297893836, "age": 12, "location": "latitude"}""")
   val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
   df.write.format("org.apache.hudi").
       option("hoodie.datasource.write.recordkey.field", "name").
       option("hoodie.datasource.write.partitionpath.field", "location").
       option("hoodie.datasource.write.precombine.field", "ts").
       option("hoodie.table.name", "tableName").
       mode("Overwrite").
       save(basePath)
   
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
   ```
   
   Right, I think it needs an new jira as a subtask of HUDI-584.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582628959
 
 
   > > If the user wants some changes in spark-avro:3.0-preview2, the right way is modify spark-version to 3.0-preview2 at pom.xml file, then build hudi project source.
   > 
   > Understood... but we would like to avoid this need for building a version of hudi by themselves. Users should be able to do `--packages <hudi-spark-bundle*.jar>` and be on their way.. This change forces everyone who is not on the spark version used by hudi to do their own builds.. I still feel this is not a good idea.
   > 
   > Once again, I wish we can have more upfront discussion on the JIRA, rather than on a PR, around issues like this. (I feel bad when someone puts in the work to do the implementation that we cannot take :()
   
   Thanks, got it. will file a disscuss email about this. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
bhasudha commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#discussion_r386190041
 
 

 ##########
 File path: README.md
 ##########
 @@ -71,6 +71,14 @@ The default Scala version supported is 2.11. To build for Scala 2.12 version, bu
 mvn clean package -DskipTests -DskipITs -Dscala-2.12
 ```
 
+### Build without spark-avro module
+
+The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using `spark-shade-unbundle-avro` profile
 
 Review comment:
   can we also add the `--package`  instruction here to explicitly say how to add spark-avro if building without it. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582669261
 
 
   Hi @vinothchandar, I didn't know you and @umehrot2 had talked about this 😭 
   https://github.com/apache/incubator-hudi/pull/1005#pullrequestreview-340275874

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-593252380
 
 
   hi @bhasudha, addressed and add `--packages` to READ.ME doc
   ```
   spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
     --packages org.apache.spark:spark-avro_2.11:2.4.4 \
     --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582609037
 
 
   >If the user wants some changes in spark-avro:3.0-preview2, the right way is modify spark-version to 3.0-preview2 at pom.xml file, then build hudi project source.
   
   Understood... but we would like to avoid this need for building a version of hudi by themselves. Users should be able to do `--packages <hudi-spark-bundle*.jar>` and be on their way.. This change forces everyone who is not on the spark version used by hudi to do their own builds.. I still feel this is not a good idea. 
   
   Once again, I wish we can have more upfront discussion on the JIRA, rather than on a PR, around issues like this. (I feel bad when someone puts in the work to do the implementation that we cannot take :() 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-594751757
 
 
   LGTM. I think this is in good shape to be merged. @vinothchandar Can you review the changes quickly?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on a change in pull request #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#discussion_r372605766
 
 

 ##########
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
 ##########
 @@ -104,7 +104,7 @@ public String showLogFileCommits(
           try {
             instantTime = n.getLogBlockHeader().get(HeaderMetadataType.INSTANT_TIME);
             if (instantTime == null) {
-              throw new Exception("Invalid instant time " + instantTime);
 
 Review comment:
   @n3nash thanks for your comment, IMO, it's a minor change (which has no function impact) may be no need separate PR. 
   
   For the minor change: The `instantTime` variable is known to be null.
   ```
   "Invalid instant time " + null
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-582655960
 
 
   Also another thing to consider here, is `spark-avro` module is only used for converting between spark's struct type and avro schema. While all the code for actual conversion of the data, is maintained inside of Hudi.
   
   I recently fixed an issue with this schema conversion https://github.com/apache/incubator-hudi/pull/1223/files#diff-3c046573a91f36ba0f12dad0e3395dc9R346 where since the way spark-avro 2.4.4 created avro namespaces was different than earlier databricks avro. And this resulted in a change in Hudi's code as well. Considering this, it might be good to actually fix the spark-avro version and relocate it, to avoid customer's running into such issues as this would fix the schema conversion logic being used.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin

Posted by GitBox <gi...@apache.org>.
codecov-io commented on issue #1290: [HUDI-584] Relocate spark-avro dependency by maven-shade-plugin
URL: https://github.com/apache/incubator-hudi/pull/1290#issuecomment-593244136
 
 
   # [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=h1) Report
   > :exclamation: No coverage uploaded for pull request base (`master@078d482`). [Click here to learn what that means](https://docs.codecov.io/docs/error-reference#section-missing-base-commit).
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-hudi/pull/1290/graphs/tree.svg?width=650&token=VTTXabwbs2&height=150&src=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff            @@
   ##             master   #1290   +/-   ##
   ========================================
     Coverage          ?   0.64%           
     Complexity        ?       2           
   ========================================
     Files             ?     287           
     Lines             ?   14319           
     Branches          ?    1465           
   ========================================
     Hits              ?      92           
     Misses            ?   14224           
     Partials          ?       3
   ```
   
   
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=footer). Last update [078d482...ac87b91](https://codecov.io/gh/apache/incubator-hudi/pull/1290?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services