You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hawq.apache.org by liming01 <gi...@git.apache.org> on 2017/02/23 05:42:20 UTC

[GitHub] incubator-hawq pull request #1141: HAWQ-1342. Fixed QE process hang in share...

GitHub user liming01 opened a pull request:

    https://github.com/apache/incubator-hawq/pull/1141

    HAWQ-1342. Fixed QE process hang in shared input scan on segment node

    The basic idea for this kinds of hung problem is to:
    (1) The error thrown segment will invoke rollback the whole transaction, and all related fd will be closed during transaction end.
    (2) The other segment just act as before, when wait for select(), it will loop until the specific fd is closed, then the code will run until process interrupt (the rollback transaction will send cancel signal) again in other place afterward.
    
    So some previous fix (HAWQ-166,  HAWQ-1282) will be changed accordingly.
    (1) HAWQ-166: we don't need to skip sending info
    (2) HAWQ-1282:
      - we don't need to close the fd, it will be closed automatically during transaction end.
      - we just end loop if we find the related FD has already been closed.
    
    Signed-off-by: Amy Bai <ab...@pivotal.io>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liming01/incubator-hawq mli/HAWQ-1342

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hawq/pull/1141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1141
    
----
commit a87d52a0671a7a794c3be84eecea14591625eba6
Author: Ming Li <ml...@apache.org>
Date:   2017-02-23T05:22:56Z

    HAWQ-1342. Fixed QE process hang in shared input scan on segment node
    
    The basic idea for this kinds of hung problem is to:
    (1) The error thrown segment will invoke rollback the whole transaction, and all related fd will be closed during transaction end.
    (2) The other segment just act as before, when wait for select(), it will loop until the specific fd is closed, then the code will run until process interrupt (the rollback transaction will send cancel signal) again in other place afterward.
    
    So some previous fix (HAWQ-166,  HAWQ-1282) will be changed accordingly.
    
    Signed-off-by: Amy Bai <ab...@pivotal.io>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #1141: HAWQ-1342. Fixed QE process hang in shared input...

Posted by zhangh43 <gi...@git.apache.org>.

Github user zhangh43 commented on the issue:

    https://github.com/apache/incubator-hawq/pull/1141
  
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1141: HAWQ-1342. Fixed QE process hang in share...

Posted by paul-guo- <gi...@git.apache.org>.

Github user paul-guo- commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1141#discussion_r102657361
  
    --- Diff: src/backend/executor/nodeShareInputScan.c ---
    @@ -925,9 +923,12 @@ writer_wait_for_acks(ShareInput_Lk_Context *pctxt, int share_id, int xslice)
     			int save_errno = errno;
     			elog(LOG, "SISC WRITER (shareid=%d, slice=%d): notify still wait for an answer, errno %d",
     					share_id, currentSliceId, save_errno);
    -			/*if error(except EINTR) happens in select, we just return to avoid endless loop*/
    -			if(errno != EINTR){
    -				return;
    +			if(save_errno == EBADF)
    +			{
    +				/* The file description is invalid, maybe this FD has been already closed by writer in some cases
    +				 * we need to break here to avoid endless loop and continue to run CHECK_FOR_INTERRUPTS.
    +				 */
    +				break;
    --- End diff --
    
    Normally EAGAIN should not happen (I did not see this on Linux manage however saw this on mac manage), but if it happens yes we should continue just as EINTR.
    
    For ENOMEM: It is debatable to quit myself or just keep trying and then risk letting os kill one or more processes. For me I'd quit the query to risk the kill of any process. For this err code, continue or error out, up to you.
    
    EINVAL and even possible other errno code ( that is a bug in either manage or kernel), we should not trust or that depend on os because the cost could be high (hang), given we do not have retry time limit here (By the way we should really have such mechanism for all hang-possible while) .



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1141: HAWQ-1342. Fixed QE process hang in share...

Posted by liming01 <gi...@git.apache.org>.

Github user liming01 closed the pull request at:

    https://github.com/apache/incubator-hawq/pull/1141


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1141: HAWQ-1342. Fixed QE process hang in share...

Posted by ictmalili <gi...@git.apache.org>.

Github user ictmalili commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1141#discussion_r102650337
  
    --- Diff: src/backend/executor/nodeShareInputScan.c ---
    @@ -925,9 +923,12 @@ writer_wait_for_acks(ShareInput_Lk_Context *pctxt, int share_id, int xslice)
     			int save_errno = errno;
     			elog(LOG, "SISC WRITER (shareid=%d, slice=%d): notify still wait for an answer, errno %d",
     					share_id, currentSliceId, save_errno);
    -			/*if error(except EINTR) happens in select, we just return to avoid endless loop*/
    -			if(errno != EINTR){
    -				return;
    +			if(save_errno == EBADF)
    +			{
    +				/* The file description is invalid, maybe this FD has been already closed by writer in some cases
    +				 * we need to break here to avoid endless loop and continue to run CHECK_FOR_INTERRUPTS.
    +				 */
    +				break;
    --- End diff --
    
    Thanks @liming01.  Good explanation. 
    
    Do we need log the error information out before "break".  For example, adding log to indicate that the fd is no longer valid.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1141: HAWQ-1342. Fixed QE process hang in share...

Posted by paul-guo- <gi...@git.apache.org>.

Github user paul-guo- commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1141#discussion_r102647614
  
    --- Diff: src/backend/executor/nodeShareInputScan.c ---
    @@ -925,9 +923,12 @@ writer_wait_for_acks(ShareInput_Lk_Context *pctxt, int share_id, int xslice)
     			int save_errno = errno;
     			elog(LOG, "SISC WRITER (shareid=%d, slice=%d): notify still wait for an answer, errno %d",
     					share_id, currentSliceId, save_errno);
    -			/*if error(except EINTR) happens in select, we just return to avoid endless loop*/
    -			if(errno != EINTR){
    -				return;
    +			if(save_errno == EBADF)
    +			{
    +				/* The file description is invalid, maybe this FD has been already closed by writer in some cases
    +				 * we need to break here to avoid endless loop and continue to run CHECK_FOR_INTERRUPTS.
    +				 */
    +				break;
    --- End diff --
    
    The code change looks fine. Just I think for errno other than EINTR, we should log(ERROR, ....)
    
    Same comment for similar changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #1141: HAWQ-1342. Fixed QE process hang in shared input...

Posted by ictmalili <gi...@git.apache.org>.

Github user ictmalili commented on the issue:

    https://github.com/apache/incubator-hawq/pull/1141
  
    LGTM. +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1141: HAWQ-1342. Fixed QE process hang in share...

Posted by liming01 <gi...@git.apache.org>.

Github user liming01 commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1141#discussion_r102663208
  
    --- Diff: src/backend/executor/nodeShareInputScan.c ---
    @@ -925,9 +923,12 @@ writer_wait_for_acks(ShareInput_Lk_Context *pctxt, int share_id, int xslice)
     			int save_errno = errno;
     			elog(LOG, "SISC WRITER (shareid=%d, slice=%d): notify still wait for an answer, errno %d",
     					share_id, currentSliceId, save_errno);
    -			/*if error(except EINTR) happens in select, we just return to avoid endless loop*/
    -			if(errno != EINTR){
    -				return;
    +			if(save_errno == EBADF)
    +			{
    +				/* The file description is invalid, maybe this FD has been already closed by writer in some cases
    +				 * we need to break here to avoid endless loop and continue to run CHECK_FOR_INTERRUPTS.
    +				 */
    +				break;
    --- End diff --
    
    @ictmalili ,  already printed log there, so need to to add code.
    @paul-guo- I just process the error code as doc described, if there are some bug is OS, then we should check log when the hung problem occur again. At that time, we can find out the root cause. We just need to make sure not to hung by hawq code bug here. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1141: HAWQ-1342. Fixed QE process hang in share...

Posted by liming01 <gi...@git.apache.org>.

Github user liming01 commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1141#discussion_r102649930
  
    --- Diff: src/backend/executor/nodeShareInputScan.c ---
    @@ -925,9 +923,12 @@ writer_wait_for_acks(ShareInput_Lk_Context *pctxt, int share_id, int xslice)
     			int save_errno = errno;
     			elog(LOG, "SISC WRITER (shareid=%d, slice=%d): notify still wait for an answer, errno %d",
     					share_id, currentSliceId, save_errno);
    -			/*if error(except EINTR) happens in select, we just return to avoid endless loop*/
    -			if(errno != EINTR){
    -				return;
    +			if(save_errno == EBADF)
    +			{
    +				/* The file description is invalid, maybe this FD has been already closed by writer in some cases
    +				 * we need to break here to avoid endless loop and continue to run CHECK_FOR_INTERRUPTS.
    +				 */
    +				break;
    --- End diff --
    
    select() ERRORS :
    EBADF  -- break;
    EINTR   -- loop
    EINVAL -- programming error, should not occurs
    
    on Linux:
    ENOMEM -- loop, wait for runaway to choose one transaction to rollback, or OS choose one process to kill
    
    On macos:
    EAGAIN -- loop
    
    EAGAIN -- loop
    
    So we just process the EBADF only, others are loop again or impossible to occurs. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #1141: HAWQ-1342. Fixed QE process hang in share...

Posted by ictmalili <gi...@git.apache.org>.

Github user ictmalili commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1141#discussion_r102647741
  
    --- Diff: src/backend/executor/nodeShareInputScan.c ---
    @@ -925,9 +923,12 @@ writer_wait_for_acks(ShareInput_Lk_Context *pctxt, int share_id, int xslice)
     			int save_errno = errno;
     			elog(LOG, "SISC WRITER (shareid=%d, slice=%d): notify still wait for an answer, errno %d",
     					share_id, currentSliceId, save_errno);
    -			/*if error(except EINTR) happens in select, we just return to avoid endless loop*/
    -			if(errno != EINTR){
    -				return;
    +			if(save_errno == EBADF)
    +			{
    +				/* The file description is invalid, maybe this FD has been already closed by writer in some cases
    +				 * we need to break here to avoid endless loop and continue to run CHECK_FOR_INTERRUPTS.
    +				 */
    +				break;
    --- End diff --
    
    Agree! Can we log out the information?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---