You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/20 15:22:19 UTC

[GitHub] [hudi] leobiscassi opened a new pull request, #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

leobiscassi opened a new pull request, #5647:
URL: https://github.com/apache/hudi/pull/5647

   ## What is the purpose of the pull request
   
   This PR aims to improve the docs on hudi delta streamer for DFS on two things:
   
   - The expected behavior around partitions that @yihua explained to me on [this](https://github.com/apache/hudi/issues/5485) issue.
   - The necessity to inform schemas for JSON files that were not clear in the docs and I had to ask on the slack channel.
   
   ## Brief change log
   
     - *Modify Distributed File System (DFS) subsection under Sources section, at this moment the changes doesn't affects versioned docs*
   
   ## Verify this pull request
   
   This pull request is a trivial improvement in the docs without any test coverage.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on pull request #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

Posted by GitBox <gi...@apache.org>.
leobiscassi commented on PR #5647:
URL: https://github.com/apache/hudi/pull/5647#issuecomment-1151758686

   Thanks @nsivabalan! 🎉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on pull request #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

Posted by GitBox <gi...@apache.org>.
leobiscassi commented on PR #5647:
URL: https://github.com/apache/hudi/pull/5647#issuecomment-1133033707

   @yihua @bhasudha this PR is related to #5485 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #5647:
URL: https://github.com/apache/hudi/pull/5647#discussion_r878806229


##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -303,6 +303,11 @@ other formats and then write data as Hudi format.)
 - ORC
 - HUDI
 
+For DFS sources the following behaviors are expected:
+
+- For JSON file format you always need to inform a schema. If the target hudi table follows the same schema from the source file, you just need to inform the schema for source, if don't you need to inform schemas for both. 

Review Comment:
   ```suggestion
   - For JSON DFS source, you always need to set a schema. If the target Hudi table follows the same schema as from the source file, you just need to set the source schema. If not, you need to set schemas for both source and target. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #5647:
URL: https://github.com/apache/hudi/pull/5647#issuecomment-1151705717

   thanks for the contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan merged pull request #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

Posted by GitBox <gi...@apache.org>.
nsivabalan merged PR #5647:
URL: https://github.com/apache/hudi/pull/5647


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on pull request #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

Posted by GitBox <gi...@apache.org>.
leobiscassi commented on PR #5647:
URL: https://github.com/apache/hudi/pull/5647#issuecomment-1149079584

   @xushiyan just curious, will this be merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] leobiscassi commented on pull request #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

Posted by GitBox <gi...@apache.org>.
leobiscassi commented on PR #5647:
URL: https://github.com/apache/hudi/pull/5647#issuecomment-1133914368

   @xushiyan thanks for the review and suggestions 👍🏾 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on pull request #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

Posted by GitBox <gi...@apache.org>.
xushiyan commented on PR #5647:
URL: https://github.com/apache/hudi/pull/5647#issuecomment-1134028467

   > @xushiyan thanks for the review and suggestions 👍🏾 Do you think it makes sense to add this to the versioned docs too, since the same behavior is expected? At least since from 0.8.0 which is the version that I started to use hudi.
   
   @leobiscassi sure. it definitely helps to backfill the versioned docs. thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #5647: Update the docs for Distributed File System (DFS) section on Hudi Delta Streamer page

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #5647:
URL: https://github.com/apache/hudi/pull/5647#discussion_r878807563


##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -303,6 +303,11 @@ other formats and then write data as Hudi format.)
 - ORC
 - HUDI
 
+For DFS sources the following behaviors are expected:
+
+- For JSON file format you always need to inform a schema. If the target hudi table follows the same schema from the source file, you just need to inform the schema for source, if don't you need to inform schemas for both. 
+- `HoodieDeltaStreamer` reads the files under the source path (`hoodie.deltastreamer.source.dfs.root`) directly, so you should not expect the tool to recognize partitions under this path as fields of the dataset. Detailed examples can be found [here](https://github.com/apache/hudi/issues/5485).

Review Comment:
   ```suggestion
   - `HoodieDeltaStreamer` reads the files under the source base path (`hoodie.deltastreamer.source.dfs.root`) directly, and it won't use the partition paths under this base path as fields of the dataset. Detailed examples can be found [here](https://github.com/apache/hudi/issues/5485).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org