You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2008/04/01 02:38:24 UTC

[jira] Created: (PIG-176) pig creates many small files when it spills

pig creates many small files when it spills
-------------------------------------------

                 Key: PIG-176
                 URL: https://issues.apache.org/jira/browse/PIG-176
             Project: Pig
          Issue Type: Bug
            Reporter: Olga Natkovich
            Assignee: Alan Gates


Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.

The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-176) pig creates many small files when it spills

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586385#action_12586385 ] 

Pi Song commented on PIG-176:
-----------------------------

Based on the fact that now we spill big bags first, my observation is that there are still cases where a big container bag is spilled and therefore its mContent becomes empty but most of its inner bags' WeakReferences aren't clean-up by GC yet. In such cases, if we haven't freed up enough memory, those inner bags will be unnecessarily spilled (however all their contents were already spilled in the big bag spill). Possibly that are 2 simple ways to solve this:- 

1) In SpillableMemoryManager, we try putting Thread.yield() in between each spill. This should allow some more time for GC to do more clean-up without degrading performance too much. However, if the main execution thread doesn't produce any bag (e.g. a map task where all keys and values are tuples and atomic data), this will give more time to the main execution thread to use up more memory more quickly.

2) Check the size of the current spillable being spilled. If it is larger than constant X, do a System.GC(). This is safer than (1) but due to the fact that we explicitly call GC more often, it may have some impact on performance. However, by considering the fact that spilling small files is much slower than doing System.GC(), this approach should then generally give a better performance.

I don't really have a processing task that incurs spilling that much. Can anyone please try (2) out?

> pig creates many small files when it spills
> -------------------------------------------
>
>                 Key: PIG-176
>                 URL: https://issues.apache.org/jira/browse/PIG-176
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>
> Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-176) pig creates many small files when it spills

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590238#action_12590238 ] 

Pi Song commented on PIG-176:
-----------------------------

OK, will do that.

> pig creates many small files when it spills
> -------------------------------------------
>
>                 Key: PIG-176
>                 URL: https://issues.apache.org/jira/browse/PIG-176
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>         Attachments: pig_176_smallbags_v1.patch
>
>
> Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-176) pig creates many small files when it spills

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585163#action_12585163 ] 

Olga Natkovich commented on PIG-176:
------------------------------------

Pi,

Running faster is part of it. The other part is not to fill up disks with tiny files which causes disk frgamentation and also takes forever to cleanup at the end of processing though you suggestion of cleaning as we go might help that a bit.

> pig creates many small files when it spills
> -------------------------------------------
>
>                 Key: PIG-176
>                 URL: https://issues.apache.org/jira/browse/PIG-176
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>
> Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-176) pig creates many small files when it spills

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pi Song updated PIG-176:
------------------------

    Attachment: pig176_v2.patch

Updated with the latest trunk + make use of the new configuration structure

> pig creates many small files when it spills
> -------------------------------------------
>
>                 Key: PIG-176
>                 URL: https://issues.apache.org/jira/browse/PIG-176
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>         Attachments: pig176_v2.patch, pig_176_smallbags_v1.patch
>
>
> Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-176) pig creates many small files when it spills

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pi Song updated PIG-176:
------------------------

    Attachment: pig_176_smallbags_v1.patch

This patch implements (1) Spill file size threshold  (2)My idea in the last comment

"spill.size.threshold" and "spill.gc.activation.size" are to be set as JVM parameters or .pigrc in order to use this new feature. Default values are 0 and Long.MAX_VALUE respectively.

There is a bit of problem in (1) that Bag.getMemorySize() sometimes doesn't return accurate value so even the threshold is set, it's still possible that files smaller than the threshold are created.

The configuration code is still messy in MapReduceLauncher. This needs a clean-up after the configuration patch gets in.

> pig creates many small files when it spills
> -------------------------------------------
>
>                 Key: PIG-176
>                 URL: https://issues.apache.org/jira/browse/PIG-176
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>         Attachments: pig_176_smallbags_v1.patch
>
>
> Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-176) pig creates many small files when it spills

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates resolved PIG-176.
----------------------------

       Resolution: Fixed
    Fix Version/s: 0.1.0

Patch checked in at revision 652906.

> pig creates many small files when it spills
> -------------------------------------------
>
>                 Key: PIG-176
>                 URL: https://issues.apache.org/jira/browse/PIG-176
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>             Fix For: 0.1.0
>
>         Attachments: pig176_v2.patch, pig_176_smallbags_v1.patch
>
>
> Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-176) pig creates many small files when it spills

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585069#action_12585069 ] 

Pi Song commented on PIG-176:
-----------------------------

So let's say if the size is smaller than something, don't spill right? This is very easy to fix but we will be able to reclaim a bit less memory than before therefore causing some tasks to fail more often in exchange for some tasks running faster. Is this acceptable?

Probably the best way to go is to make it configurable but Pig-111 isn't in yet. Sighhh..... I want to have more time.

> pig creates many small files when it spills
> -------------------------------------------
>
>                 Key: PIG-176
>                 URL: https://issues.apache.org/jira/browse/PIG-176
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>
> Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-176) pig creates many small files when it spills

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590074#action_12590074 ] 

Alan Gates commented on PIG-176:
--------------------------------

Pi,

Did you want to rework this patch now since PIG-111 is in and you can read the properties from pig's Properties object rather than System.getProperties()?

> pig creates many small files when it spills
> -------------------------------------------
>
>                 Key: PIG-176
>                 URL: https://issues.apache.org/jira/browse/PIG-176
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>         Attachments: pig_176_smallbags_v1.patch
>
>
> Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.