You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/11 14:40:21 UTC

[GitHub] [hudi] loukey-lj opened a new pull request #2433: Hudi 1511

loukey-lj opened a new pull request #2433:
URL: https://github.com/apache/hudi/pull/2433


   InstantGenerateOperator support multiple parallelism.
   When InstantGenerateOperator subtask size greater than 1 we can set subtask 0 as a main subtask, only main task create new instant.
   The prerequisite of create new instant is exist subtask received data in current checkpoint. Every subtask will create a tmp file,
   flie name is make up by checkpointid,subtask index and received records size. 
   The main subtask will check every subtask file and parse file to make sure is shuold to create new instant. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

Posted by GitBox <gi...@apache.org>.
danny0405 commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758372866


   The file check of each task is useless because even if a task of the source has no data for some time interval, the checkpoint still can trigger normally. So all task checkpoint successfully does not mean there is data. (I have solved this in mu #step1 PR though)
   
   There is no need to checkpoint the write status in KeyedWriteProcessOperator, because we can not start a new instant if the last instant failes, the more proper/simple way is to retry the commit actions several times and trigger failover if still fails.
   
   BTW, IMO, we should finish RFC-24 first as fast as possible, it sloves many bugs and has many improvements. After that i would add a compatible pipeline and this PR can apply there, and i can help to review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] yanghua commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758474842


   > The file check of each task is useless because even if a task of the source has no data for some time interval, the checkpoint still can trigger normally. So all task checkpoint successfully does not mean there is data. (I have solved this in mu #step1 PR though)
   
   That's true. We could remove the file check logic. While the implementation of multiple parallelisms is reasonable.
   
   > There is no need to checkpoint the write status in KeyedWriteProcessOperator, because we can not start a new instant if the last instant failes, the more proper/simple way is to retry the commit actions several times and trigger failover if still fails.
   
   Sounds reasonable, agree.
   
   > BTW, IMO, we should finish RFC-24 first as fast as possible, it sloves many bugs and has many improvements. After that i would add a compatible pipeline and this PR can apply there, and i can help to review.
   
   I personally think that a better form of community participation is:
   
   1) Control the granularity of changes;
   2) Each submission is a complete function point, so that the working behavior of the code does not change;
   
   I actually want to know if you have removed the compatible version of `OperatorCoordinator`. What is the design? Will it be better than the current one? Will it be better than this PR design?
   
   All this is opaque.
   
   Your Step 1 is a large-scale refactoring, and the merged code will make this client immediately unavailable. It is currently in a critical period before the release of 0.7 (if we do not have the energy to merge step 2, 3 in the short term?). Why can't we optimize it step by step?
   
   In fact, the first and second step you need to optimize is File Assigner. SF Express has already implemented it and is ready to provide PR. In fact, we improve on the existing basis, and risks and changes are controllable, right? I think that in the end, we provide a more "elegant implementation" of OperatorCoordinator for the higher version, which is the correct order.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758502012


   @yanghua @danny0405 @loukey-lj  This PR has been recreated and can't be reopen now,
   please move to https://github.com/apache/hudi/pull/2434 and communicate there
   Thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] loukey-lj closed pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

Posted by GitBox <gi...@apache.org>.
loukey-lj closed pull request #2433:
URL: https://github.com/apache/hudi/pull/2433


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

Posted by GitBox <gi...@apache.org>.
danny0405 commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758486505


   > That's true. We could remove the file check logic. While the implementation of multiple parallelisms is reasonable.
   
   Don't think so, the implementation relies on the checkpoint to start a instant, it does not work if the checkpoint data buffer is huge.
   
   > I actually want to know if you have the idea of the compatible version of OperatorCoordinator. What is the design?
   
   Already updated the RFC-24 WIKI, at least from the design, it is more reasonable.
   
   > Your Step 1 is a large-scale refactoring
   
   I have no choice, because the original code is far away from production ready, in order to make it robust enough, i have to make big changes, and i already split it into 4 step, if you have no time to review, i would ask others to help, thanks though ~
   
   >  It is currently in a critical period before the release of 0.7
   
   I don't think this PR or mine can be merged in the release 0.7, release 0.7 is already in RC, all the refactoring should be done before 0.8 is released, so at least, we do not break the compatibility of the release version. Yes, there may be some time that the lower version does not work, but the final release is ok.
   
   > In fact, the first and second step you need to optimize is File Assigner. SF Express has already implemented it and is ready to provide PR. 
   
   Maybe they have a PR, but IMO, the PR does not have high code quality, there are no test frameworks even. No one can tell if the code works correctly or suitable for big data set.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] yanghua edited a comment on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

Posted by GitBox <gi...@apache.org>.
yanghua edited a comment on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758474842


   > The file check of each task is useless because even if a task of the source has no data for some time interval, the checkpoint still can trigger normally. So all task checkpoint successfully does not mean there is data. (I have solved this in mu #step1 PR though)
   
   That's true. We could remove the file check logic. While the implementation of multiple parallelisms is reasonable.
   
   > There is no need to checkpoint the write status in KeyedWriteProcessOperator, because we can not start a new instant if the last instant failes, the more proper/simple way is to retry the commit actions several times and trigger failover if still fails.
   
   Sounds reasonable, agree.
   
   > BTW, IMO, we should finish RFC-24 first as fast as possible, it sloves many bugs and has many improvements. After that i would add a compatible pipeline and this PR can apply there, and i can help to review.
   
   I personally think that a better form of community participation is:
   
   1) Control the granularity of changes;
   2) Each submission is a complete function point, so that the working behavior of the code does not change;
   
   I actually want to know if you have the idea of the compatible version of `OperatorCoordinator`. What is the design? Will it be better than the current one? Will it be better than this PR design?
   
   All this is opaque.
   
   Your Step 1 is a large-scale refactoring, and the merged code will make this client immediately unavailable. It is currently in a critical period before the release of 0.7 (if we do not have the energy to merge step 2, 3 in the short term?). Why can't we optimize it step by step?
   
   In fact, the first and second step you need to optimize is File Assigner. SF Express has already implemented it and is ready to provide PR. In fact, we improve on the existing basis, and risks and changes are controllable, right? I think that in the end, we provide a more "elegant implementation" of OperatorCoordinator for the higher version, which is the correct order.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] yanghua commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758327239


   @danny0405 wdyt about this optimization?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] yanghua commented on pull request #2433: [HUDI-1511] InstantGenerateOperator support multiple parallelism

Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #2433:
URL: https://github.com/apache/hudi/pull/2433#issuecomment-758321935


   @loukey-lj thanks for your contribution! Can you please:
   
   1) Fix the Travis issue? It's red now;
   2) Update the RFC-13 and describe your optimization.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org