You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Haitao Yao (JIRA)" <ji...@apache.org> on 2012/07/11 11:42:34 UTC

[jira] [Created] (PIG-2812) Spill InternalCachedBag into only 1 file

Haitao Yao created PIG-2812:
-------------------------------

             Summary: Spill InternalCachedBag into only 1 file
                 Key: PIG-2812
                 URL: https://issues.apache.org/jira/browse/PIG-2812
             Project: Pig
          Issue Type: Bug
          Components: data
            Reporter: Haitao Yao


I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
Why not just hold the tmp file handle and spill only one tmp file?
Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Haitao Yao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412459#comment-13412459 ] 

Haitao Yao commented on PIG-2812:
---------------------------------

Oh God, this is really a big change. It relates the Iterator of the data bags.
If we use 1 spill file, and every time we call the next(), we have to skip all the read bytes, and this will be a big performance penalty. I think we can consider change the iterator interface with a method free() added, so we can hold an InputStream and make sure the free method is called when the iterator finish its job.
Without the free method , we can not assure that somebody call break in an iterator's loop. This will cause InputStream leak.

                
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Haitao Yao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Haitao Yao updated PIG-2812:
----------------------------

    Attachment: spill.patch

patch for spill
spill into only 1 directory and use shutdownhook to delete the dir

                
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg, spill.patch
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Haitao Yao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432882#comment-13432882 ] 

Haitao Yao commented on PIG-2812:
---------------------------------

@Alan Gates  
They are cleared in the clear method, but in case some leaks happens, the deleteOnExit is required. Even if the file is deleted, the file path is still stored in java.io.DeleteOnExitHook and still you may get OOM because of this.

                
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg, spill.patch
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Haitao Yao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Haitao Yao updated PIG-2812:
----------------------------

    Patch Info: Patch Available
    
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg, spill.patch
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-2812:
-----------------------------------

    Assignee: Haitao Yao
    
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>            Assignee: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg, spill.patch
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413301#comment-13413301 ] 

Alan Gates commented on PIG-2812:
---------------------------------

Why not just change the {{clear}} method to delete the temp file?
                
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Haitao Yao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419768#comment-13419768 ] 

Haitao Yao commented on PIG-2812:
---------------------------------

well, the spilled files should have been cleared under the normal condition. But if your job failed, and the Tasktracker reuses the child java process, the OOM would happen.
I really think spill to one file is better.

                
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Haitao Yao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Haitao Yao updated PIG-2812:
----------------------------

    Attachment: aa.jpg

the heap dump analyze of OOMed reducer.

                
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>         Attachments: aa.jpg
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Haitao Yao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412421#comment-13412421 ] 

Haitao Yao commented on PIG-2812:
---------------------------------

Shall we change the Parent class : org.apache.pig.data.DefaultAbstractBag or just modify the InternalCachedBag ? 
I don't know why DefaultAbstractBag writes a tmp file for every tuple.
In our own code base, I just modified the InternalCachedBag, because I don't want know the how the subclasses of DefaultAbstractBag gonna behave if I modify the whole spill logic .
I want to contribute to this.
thanks.

                
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Julien Le Dem (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Le Dem updated PIG-2812:
-------------------------------

    Fix Version/s:     (was: 0.11)

I'm detaching this from pig-0.11 as it is not ready yet
                
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>            Assignee: Haitao Yao
>         Attachments: aa.jpg, spill.patch
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2812:
----------------------------

    Fix Version/s: 0.11
    
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2812) Spill InternalCachedBag into only 1 file

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486634#comment-13486634 ] 

Olga Natkovich commented on PIG-2812:
-------------------------------------

Alan - are you planning to review this one? Do we need to include this in 0.11?
                
> Spill InternalCachedBag into only 1 file
> ----------------------------------------
>
>                 Key: PIG-2812
>                 URL: https://issues.apache.org/jira/browse/PIG-2812
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Haitao Yao
>            Assignee: Haitao Yao
>             Fix For: 0.11
>
>         Attachments: aa.jpg, spill.patch
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I found out that the InternalCachedBag creates a seperate tmp file, and the tmp files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira