You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Ben-Zvi <gi...@git.apache.org> on 2016/09/10 01:10:22 UTC

[GitHub] drill pull request #585: DRILL-3898 : Sort spill was modified to catch all e...

GitHub user Ben-Zvi opened a pull request:

    https://github.com/apache/drill/pull/585

    DRILL-3898 :  Sort spill was modified to catch all errors, ignore rep\u2026

    \u2026eated errors while closing the new group and issue a more detailed error message.
    
    Seems that the spilling IO can run into various kinds of errors (no space, failure to create a file,..) which are thrown as different exception classes. Hence changed the catch() statement to catch a more general Throwable , and add the exception's message for more detail (e.g., no disk space).
    
    Before the change the "no disk space" Throwable was not caught, and thus execution continued.
    
    Also the closing of the newGroup could hit some IO errors (e.g., when flushing), so a try/catch was added to ignore those.
    
    Note that this change should also fix  DRILL-4542 ("if external sort fails to spill to disk, memory is leaked and wrong error message is displayed"). 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Ben-Zvi/drill DRILL-3898

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/585.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #585
    
----
commit e988f1644be1d9fde24a489d94c7dbc54f8e82d8
Author: Boaz Ben-Zvi <bo...@mapr.com>
Date:   2016-09-09T23:36:03Z

    DRILL-3898 :  Sort spill was modified to catch all errors, ignore repeated errors while closing the new group and issue a more detailed error message.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #585: DRILL-3898 : Sort spill was modified to catch all e...

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi closed the pull request at:

    https://github.com/apache/drill/pull/585


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #585: DRILL-3898 : Sort spill was modified to catch all e...

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/585#discussion_r79255636
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/ExternalSortBatch.java ---
    @@ -592,11 +592,14 @@ public BatchGroup mergeAndSpill(LinkedList<BatchGroup> batchGroups) throws Schem
           }
           injector.injectChecked(context.getExecutionControls(), INTERRUPTION_WHILE_SPILLING, IOException.class);
           newGroup.closeOutputStream();
    -    } catch (Exception e) {
    +    } catch (Throwable e) {
           // we only need to cleanup newGroup if spill failed
    -      AutoCloseables.close(e, newGroup);
    +      try {
    +        AutoCloseables.close(e, newGroup);
    +      } catch (Throwable t) { /* close() may hit the same IO issue; just ignore */ }
    --- End diff --
    
    In the case of no disk space to spill, close() tries to cleanup by calling flushBuffer() which eventually throws the same exception as there's still no space:
    
    at java.io.FileOutputStream.write(FileOutputStream.java:326)
    	  at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:246)
    	  at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    	  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    	  - locked <0x24e5> (a java.io.BufferedOutputStream)
    	  at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
    	  at java.io.DataOutputStream.write(DataOutputStream.java:107)
    	  - locked <0x24e7> (a org.apache.hadoop.fs.FSDataOutputStream)
    	  at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:419)
    	  at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:206)
    	  at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:163)
    	  - locked <0x24e8> (a org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer)
    	  at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:144)
    	  at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:407)
    	  at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
    	  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
    	  at org.apache.drill.exec.physical.impl.xsort.BatchGroup.close(BatchGroup.java:169)
    	  at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76)
    	  at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:53)
    	  at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:43)
    	  at org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:598)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #585: DRILL-3898 : Sort spill was modified to catch all errors, ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/585
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #585: DRILL-3898 : Sort spill was modified to catch all errors, ...

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi commented on the issue:

    https://github.com/apache/drill/pull/585
  
    Below are results from testing; first run with not enough disk space, and the second with a missing storage for the spill:
    
    0: jdbc:drill:zk=local> create table store_sales_20(ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk, s_sold_date_sk, ss_promo_sk) partition by (ss_promo_sk) as
    . . . . . . . . . . . >  select
    . . . . . . . . . . . >    case when columns[2] = '' then cast(null as varchar(100)) else cast(columns[2] as varchar(100)) end,
    . . . . . . . . . . . >    case when columns[3] = '' then cast(null as varchar(100)) else cast(columns[3] as varchar(100)) end,
    . . . . . . . . . . . >    case when columns[4] = '' then cast(null as varchar(100)) else cast(columns[4] as varchar(100)) end, 
    . . . . . . . . . . . >    case when columns[5] = '' then cast(null as varchar(100)) else cast(columns[5] as varchar(100)) end, 
    . . . . . . . . . . . >    case when columns[0] = '' then cast(null as varchar(100)) else cast(columns[0] as varchar(100)) end, 
    . . . . . . . . . . . >    case when columns[8] = '' then cast(null as varchar(100)) else cast(columns[8] as varchar(100)) end
    . . . . . . . . . . . > FROM dfs.`/Users/boazben-zvi/data/store_sales/store_sales.dat`;
    Error: RESOURCE ERROR: External Sort encountered an error while spilling to disk
    
    java.io.IOException: No space left on device
    Fragment 0:0
    
    [Error Id: 35d13ef6-f88a-4a80-9f5e-ddb15efc9d92 on 10.250.57.63:31010] (state=,code=0)
    0: jdbc:drill:zk=local> create table store_sales_20(ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk, s_sold_date_sk, ss_promo_sk) partition by (ss_promo_sk) as
    . . . . . . . . . . . >  select
    . . . . . . . . . . . >    case when columns[2] = '' then cast(null as varchar(100)) else cast(columns[2] as varchar(100)) end,
    . . . . . . . . . . . >    case when columns[3] = '' then cast(null as varchar(100)) else cast(columns[3] as varchar(100)) end,
    . . . . . . . . . . . >    case when columns[4] = '' then cast(null as varchar(100)) else cast(columns[4] as varchar(100)) end, 
    . . . . . . . . . . . >    case when columns[5] = '' then cast(null as varchar(100)) else cast(columns[5] as varchar(100)) end, 
    . . . . . . . . . . . >    case when columns[0] = '' then cast(null as varchar(100)) else cast(columns[0] as varchar(100)) end, 
    . . . . . . . . . . . >    case when columns[8] = '' then cast(null as varchar(100)) else cast(columns[8] as varchar(100)) end
    . . . . . . . . . . . > FROM dfs.`/Users/boazben-zvi/data/store_sales/store_sales.dat`;
    Error: RESOURCE ERROR: External Sort encountered an error while spilling to disk
    
    Mkdirs failed to create /tmp/drill/spill/282cbdbc-630a-2218-3871-165491f5e96c_majorfragment0_minorfragment0_operator6 (exists=false, cwd=file:/Users/boazben-zvi/IdeaProjects/drill)
    Fragment 0:0
    
    [Error Id: dea8b3fd-9661-48b5-9a3c-11d2dadf8f07 on 10.250.57.63:31010] (state=,code=0)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #585: DRILL-3898 : Sort spill was modified to catch all e...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/585#discussion_r79096107
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/ExternalSortBatch.java ---
    @@ -592,11 +592,14 @@ public BatchGroup mergeAndSpill(LinkedList<BatchGroup> batchGroups) throws Schem
           }
           injector.injectChecked(context.getExecutionControls(), INTERRUPTION_WHILE_SPILLING, IOException.class);
           newGroup.closeOutputStream();
    -    } catch (Exception e) {
    +    } catch (Throwable e) {
           // we only need to cleanup newGroup if spill failed
    -      AutoCloseables.close(e, newGroup);
    +      try {
    +        AutoCloseables.close(e, newGroup);
    +      } catch (Throwable t) { /* close() may hit the same IO issue; just ignore */ }
    --- End diff --
    
    It looks like  close(Throwable t, AutoCloseable) suppresses the exception; did you get an exception during testing ?  Otherwise, you could remove this second try-catch. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #585: DRILL-3898 : Sort spill was modified to catch all e...

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/585#discussion_r79267883
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/ExternalSortBatch.java ---
    @@ -592,11 +592,14 @@ public BatchGroup mergeAndSpill(LinkedList<BatchGroup> batchGroups) throws Schem
           }
           injector.injectChecked(context.getExecutionControls(), INTERRUPTION_WHILE_SPILLING, IOException.class);
           newGroup.closeOutputStream();
    -    } catch (Exception e) {
    +    } catch (Throwable e) {
           // we only need to cleanup newGroup if spill failed
    -      AutoCloseables.close(e, newGroup);
    +      try {
    +        AutoCloseables.close(e, newGroup);
    +      } catch (Throwable t) { /* close() may hit the same IO issue; just ignore */ }
    --- End diff --
    
    The root cause for the whole bug is in Hadoop's RawLocalFileSystem.java:
    
    package org.apache.hadoop.fs;
    .....
        public void write(byte[] b, int off, int len) throws IOException {
          try {
            fos.write(b, off, len);
          } catch (IOException e) {                // unexpected exception
            throw new FSError(e);                  // assume native fs error
          }
        }
        
    And FSError is not a subclass of IOException !!!  
    
    java.lang.Object
        java.lang.Throwable
            java.lang.Error
                org.apache.hadoop.fs.FSError
    
    So the only common ancestor is Throwable .  And any part in the drill code that catches only IOException will not catch !!
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---