You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by "Eli Reisman (JIRA)" <ji...@apache.org> on 2012/08/14 20:17:38 UTC

[jira] [Created] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Eli Reisman created GIRAPH-301:
----------------------------------

Summary: InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
Key: GIRAPH-301
URL: https://issues.apache.org/jira/browse/GIRAPH-301
Project: Giraph
Issue Type: Improvement
Components: bsp, graph, zookeeper
Affects Versions: 0.2.0
Reporter: Eli Reisman
Assignee: Eli Reisman
Fix For: 0.2.0
Attachments: GIRAPH-301-1.patch

With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.

Essentially, the current algo is:

1. scan input split list, skipping nodes that are marked "Finsihed"

2. grab the first unfinished node in the list (reserved or not) and check its reserved status.

3. if not reserved, attempt to reserve & return it if successful.

4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.

This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.

This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435348#comment-13435348 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

one thing we can do to reduce ZK reads at the beginning is to check for RESERVED before FINISHED splits, since we will encounter many more of those on the critical (busy) first pass to claim nodes, and the loop can continue without doing the 2nd ZK read to check for FINSIHED (there shouldn't be too many of those in the first pass). This will add reads for workers that awaken after the first pass and look for another split to claim, since by then most will be finished. But it avoids the mad rush. I might experiment with this change and see what effect is has.Given the fact that we want to operate within pretty tight memory constraints, trying to tune for 1 worker per split when possible seems like the right move. Any other ideas? I do think one was or another the idea of workers going back to sleep the first time they attempt to claim a split is going to guarantee the clumping behavior if left as-is.
 
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436083#comment-13436083 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

Please do, I agree with your concerns about scale and ZK, its worth really testing at scale before we can feel good about a change like this.
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-301:
-------------------------------

    Attachment: GIRAPH-301-1.patch
    
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444543#comment-13444543 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

Crap. I got cluster access again and ran it. there is a small problem it seems in the refactor. Figuring it out now, will post a JIRA and a patch today. Not quite sure why, but it I think it breaks the input superstep (the iterator is getting a bad array index sometimes!)


                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch, GIRAPH-301-7.patch, GIRAPH-301-8.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437616#comment-13437616 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

GIRAPH-301-5 has been tested to the largest scale and spreads the work as evenly as any solution I have tried here yet while speeding up the INPUT_SUPERSTEP by avoiding slow ZK read-iterations for each worker down its split list by trying to set up each worker to find at least its first unclaimed split with minimal contention and without sacrificing locality where a worker might have it available, at least on the first split it reads. In our use case here, where we want to spread the work among many memory-constrained workers, this is the ideal case to avoid overloads and crashes. Very good complement to the original locality patch, probably should have been a feature of that one :)
 
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-301:
-------------------------------

    Attachment: GIRAPH-301-2.patch

Much simpler, does the same thing. Still going to be more ZK calls, but so far not a problem. More testing on the way...

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438838#comment-13438838 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

After more runs (and some thinking) I realized why I'm seeing some of these symptoms and why this speeds up the INPUT_SUPERSTEP so much. Please advise/correct if I'm wrong here. The idea is:

The reason Locality and InputSplit reads are so fast is that the master writes the znodes once, and all other transactions are reads by workers. This is where ZK shines.

The reason the reads on RESERVED and FINISHED lists are so slow is that those lists are also being concurrently written by workers all the time. This means the quorum must re-sync all the time, and all the readers (every worker) is dragging down the list as it reads, every iteration.

Avoiding having the workers read FINISHED nodes to decide when to bail out of the reserveInputSplit() cycles and simply await the next superstep has two advantages. One, workers bail out of read cycles when all splits are RESERVED, so they get out of the loop sooner. In our current code, if a worker fails reading a split, the whole job goes down anyway. By avoiding excessive cycling, fewer readers crawl the list and the speed of that crawl improves. Even in a scenario where workers could restart, only the last one to scan the RESERVED list could fail and not have someone else find that split re-opened and claim it.

My logs show many workers looping on the RESERVED list, and doing extra loops until in fact all RESERVED splits are also marked on the FINISHED list (often several more loops which are slooooow) and finally dropping off when they find all FINISHED. The mystery: then they STILL wait longer, sometimes much longer, while other workers continue to iterate the FINISHED list to be sure themselves. So why is this happening, all nodes are FINISHED, the master should signal the barrier and off we go, right?

No. The master is ALSO still iterating, and is slow since most workers are still jostling in line to iterate the FINISHED list too. By having workers never read this list, you still have a lot of syncing going on as each worker marks its read splits FINISHED, but only ONE reader ever sees it -- the master! THIS is where I think a lot of the "end of the INPUT_SUPERSTEP" speedup is happening.

The "beginning of the INPUT_SUPERSTEP" speedup is mostly due to the code in 301-5, which is also in this 301-6 patch, that simply places each worker at a different index as GIRAPH-250 did, but with locality also maintained. This has been logged to ensure that on the first split claimed by any worker, if you choose at command-line a 1-to-1 ratio of workers to splits, everyone gets their split claimed with only 1-2 reads to ZK, and then each worker does one last unavoidable loop on RESERVED list to see no more splits are available, and simply sleeps at the barrier. Combine that with eliminating the extra loops to check the FINISHED list before dropping out (from this patch), and you have the full speed up.

Does this sound reasonable? It certainly explains what I see on trunk, 301-5, and 301-6 runs in my logs, and speed increases on the same data load and worker #'s over many runs on these 3 versions of Giraph. Any other ideas? 

More important: am I missing anything critical I just didn't tease out in testing as far as barrier dangers with this modification? I have searched the code for everywhere FINISHED and RESERVED znodes are messed with, and it looks good to me.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442426#comment-13442426 ] 

Alessandro Presta commented on GIRAPH-301:
------------------------------------------

Hey Eli, good news on the testing.

Since this has gone a long way from the initial attempt (continue iteration instead of going to sleep), let me make sure I know what's going on:
1) LocalityInfoSorter rearranges the input split list so that local splits come first, and now also determines the first non-local split based on hashing the worker info.
2) When iterating over the split list, workers only check for RESERVED state instead of FINISHED.
Please correct me/expand upon this.

The code looks good. As an advocate of encapsulation, I would maybe refactor it so that LocalityInfoSorter directly provides an iterable over the splits in the correct order, instead of separately exposing the underlying list and getAdjustedIndex(), but it's just a thought.
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch, GIRAPH-301-7.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-301:
-------------------------------

    Attachment: GIRAPH-301-7.patch

just a quick rebase, no changes.
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch, GIRAPH-301-7.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435858#comment-13435858 ] 

Alessandro Presta commented on GIRAPH-301:
------------------------------------------

Cool, I'll wait for the results.
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435284#comment-13435284 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

Hey Alessandro,

Yes I think on 4, that was the idea, and it makes sense. Problem is, if you only check one possibly available node before deciding to sleep, and then start reading the list again from index 0 of the split list, you contend with other workers whenever they finish a node and wake you up to keep iterating. Whoever gets the first open slot, the others fail to claim it and go back to sleep instead of continuing to iterate.

Worse, when you read big data on each node and they take a long time, other nodes time out every minute or so and jump back in to attempt to claim a node. Awakened nodes iterate again, and since many other nodes are reading big splits already, the first one they encounter that has a RESERVED split, they don't claim it successfully, and back to sleep they go. So you're back to this problem of everyone (including workers who finish a split and try to iterate for a new one) going back to sleep way too eagerly. I have seen this behavior happening no matter how I set splitmb and -w since I started using Giraph, and I have been puzzled why I couldn't trick some (often many) workers into doing something when there was enough work to go around.

Users here started emailing about this clumping effect, and I had noticed it many times over the last few months. The situation I describe above is with the new locality patch making some workers read very fast (and overload trying to send out all the data as they pick up new splits like crazy) but this clumping of split-reading activity and groups of workers sleeping through the whole input phase has been happening as long as I've been using Giraph.

My cluster is down this morning for upgrades but but I hope to be back up and running this afternoon/tonight. The tests of this I ran before putting the patch up worked well: I could get just the behavior I had always expected by doing

 (# of MB of data) / (giraph.splitmb) == (# of workers you should see busy right away reading splits, if you select that many or more with -w) 

Which is, 1 split per worker right from the get-go. Other manipulations of the formula obviously split out the way one would expect when skewing in favor of extra splits or extra workers (i.e. no clumping when 50 workers, 100 splits -- almost all read 2 splits, not some reading 3-4 and some reading 0 like before)

So it comes down to your first point: is it bad to load up the zookeeper quorum with potentially reads like this? After reading both ZK papers and having this problem to think about when I added the locality patch, my opinion is "no" this is what ZK is absolutely designed for. Having a quorum of ZK's to split the read requests definitely helps, but on most clusters this is a minimum of 3 servers. This does bear more testing, of course.

The time when slowing or problems can happen is during writes. This patch tries to at least mitigate that a bit by not bothering to try to create the claim node unless we have a hint that the node is not already created. This will not be useful on the first pass when everyone is vying for nodes, but after any awakening from sleep, it is quite likely since, as of the locality patch, many work's split lists are not ordered the same any more and they may not encounter the same unclaimed nodes right away as they iterate.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-301:
-------------------------------

    Attachment: GIRAPH-301-4.patch

This modification (as was suggested for GIRAPH-250) will keep the split contention lower per-worker in situations where many workers share the same host machine. I did not see improvement with this on our clusters because this is not our typical use case, but this should be included as it covers those additional common cases.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444320#comment-13444320 ] 

Alessandro Presta commented on GIRAPH-301:
------------------------------------------

As this is basically refactoring, I think we're good. I just committed.
You can iterate more on this design in subsequent issues.
Once again, good job!
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch, GIRAPH-301-7.patch, GIRAPH-301-8.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-301:
-------------------------------

    Attachment: GIRAPH-301-3.patch

I think this approach might bear more fruit than the others, as it worked before in 250 and I probably should have incorporated into the locality patch already. If this reduces the # of reads required for a worker to find an unclaimed split on the first round of iterations, the clumping problem should be solved, and the ZK writes that begin to pile up at scale will not slow down the reads so much that so many workers never make their way through the whole list. I'll report back as to the success or failure of the tests by monday.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434961#comment-13434961 ] 

Alessandro Presta commented on GIRAPH-301:
------------------------------------------

This sounds good. Can you maybe post some results from actual runs? Number of splits read by each worker and overall length of input superstep are the two obvious metrics.

Regarding 4., I wonder why we would want to sleep after the first failed reservation. Do too frequent attempts overload the Zookeeper?
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437138#comment-13437138 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

I want to take another whack at this, the clumping is not as bad as without it but there is no reason workers should sleep after iterating when any splits are available and as I scale it up it still happens more than I would like. I considered scrambling input split lists when no locality was detected in the list by that worker to mix up the order of iteration but I don't want non-local workers getting the jump on local ones by accident because of this. It would take the pressure off ZK (although even in large-scale tests, ZK seemed fine with the extra work here.)

I'm going to play with some approaches and see if anything helps, even with the trunk code this should not be happening so much.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437619#comment-13437619 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

btw, other than making the condition for ending cycles of input split reads that every path on the znode list is marked RESERVED rather than checking also for FINISHED marks, this 301-6 patch is identical to the 301-5 patch.
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-301:
-------------------------------

    Attachment: GIRAPH-301-8.patch

Here's the new one with the Iterator. 

It passes all mvn verify tests etc. But I have not been able to run a job on the cluster due to issues here and I would have liked to make certain nothing important changed.

The changes are not to anything that would alter the patch's behavior as far as I know (the changes to the input split RESERVED/FINISHED stuff is untouched from the previous patches.) 

If this is good enough for you, then we're good to go. If you'd rather wait until I get one good run in on the cluster, I can do that too. Thanks again!


                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch, GIRAPH-301-7.patch, GIRAPH-301-8.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440509#comment-13440509 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

Update: I've run this close to 40 times to completion on 2 different clusters with greatly varying # of workers, data loads etc. and I'm satisfied its safe and effective at this point. Take your time in reviewing, fresh eyes will be useful here, but the news on the testing is it officially went very well, input superstep is by far the fastest part of any job I run with this patch in place, and it scales very well.


                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch, GIRAPH-301-7.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-301:
-------------------------------

    Attachment: GIRAPH-301-6.patch

This patch has been well tested up to the largest scales we use here, and functions even better than 306-5. I feel I should explain it as its methods might be a bit controversial.

A large data load under our constraints here took a certain 4 figure number of workers, and over 75 minutes without locality. With locality, this was reduced to 20 minutes and hundreds less workers required. With 301-5, this is lowered to 400 less workers and 15 minutes. Using this patch, this same data load in takes under 4 minutes, and 400 less workers than the original job.

The controversial part is this: as stated in earlier posts on this thread, instrumented runs while experimenting with scale this weekend have revealed that even speeding up data load in using locality and other changes in the 301 patches does not end the INPUT_SUPERSTEP as soon as it could, or completely eliminate the "clumping" effect described above.

The reason for this clumping turns out to be, that while ZK can handle large read throughput, the quorum must sync after writes before servicing a backup of many concurrent reads. Since both the FINISHED and RESERVED znode lists are being queries in all iterations on every worker, and also being mutated as splits are claimed and completed, the workers that never get a split are not sleeping throughout the input step, but in fact very, VERY slowly iterating their input split list. In some cases, the step ends before they have finished one single iteration, even if the input superstep goes on for 30 or more minutes!

This patch (301-6) dramatically speeds this up by removing the checks for FINSIHED znodes. The nodes are still created whenever a split is finished by a worker, so that the master knows when to end the barrier and begin the first calculation superstep. There is no danger of BSP barriers being tampered with. Further, every worker must read the whole list of splits at least once from the top and register every node as RESERVED before it stops trying to read any additional splits. Therefore, if a worker dies in mid-read, its ephemeral RESERVED node disappears, and others could possibly claim it, since every node must still do one full iteration on the list finding all splits RESERVED before ending its search for good and waiting on the barrier for superstep 0.

This means that the only danger of data loss would be if the very last worker to iterate fails during a split read. In this case, the next superstep will never come (as the split is never marked FINISHED) and the job fails anyway. If a worker dies in Giraph after marking a split FINISHED there is currently no algorithm in place to restore order to the calculation, even if the worker could restart and recover, so no harm done on this common failure by the changes here.

In actual fact, the real story is any worker failing and restarting during the INPUT_SUPERSTEP currently causes cascading failure to the job. Until we have a more comprehensive plan for worker failure of this sort, there is no danger whatsoever to this large optimization in the network load and speed during input superstep that comes by having the workers evaluate whether to keep iterating on the input list based on every split being RESERVED rather than FINISHED. I have added comments to BspServiceWorker#reserveInputSplit() where the changes are coded to annotate that, should the recovery story for Giraph change in the future, this algorithm optimization should be revisited.

Again, I have run this to happy completion many times today and can vouch that it causes no problems for Giraph as-is. If everyone is comfortable with this change, I think the reduced cost to network (literally cuts ZK reads from all workers during input phase in half) and the reduced time to finish the superstep are well worth it.


                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437194#comment-13437194 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

With all the writes ZK does to create FINISHED and RESERVED nodes during the split claiming process, some of the readers are getting stuck behind lots of Zk sync's are moving very slowly down their split lists, if my logs are to be believed. many are not making it to the bottom by the time all the splits are read or (worse yet) by the time one or more workers read too many and overloaded, causing job failure. This is with 4 figures of workers so you may not see behavior like this with 50-100, I did not. But we are trying for lots of workers to spread the memory load out here. I have a different approach in mind that does not call ZK but instead tries to set the worker up to only have to check a few nodes before successfully claiming one, without losing locality. Will have a patch up soon, and will test this weekend. It will re-establish the hashing from 250 as this really did seem to spread the work out more evenly and without so many iterations on the list per worker. We'll see what happens...

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443671#comment-13443671 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

Thats it exactly. The best approach after lots of versions was the 301-6 (now rebased to 301-7) which hashes where the worker will start iterating to distributed the workers across the list better, but places their local blocks at "adjusted index 0" if there are any. There is still a small chance a worker who did not find any local blocks of his/her own will hash to start iterating at a block that another worker finds to be local, but from the load in speeds I have gotten over many, many runs over the last week or so combined with the instrumented runs I did, everyone seems to get a split that is local if possible (since usually at least 2-3 candidates exist on any one for a worker to try to claim) and everyone regardless of locality seems to claim a split within 1-3 tests on the split list, which gets us through the input stage much faster.

When you try to match splits to workers 1-to-1, often every worker gets a split, occasionally a few will still end up running the list and finding all reserved. This seemed to end up with the fewest wasted workers of anything I tried.

As for RESERVED change, the comment from 21/Aug/12 17:32 is still exactly what happens, and while I can't confirm I am right about why it works (although I think those comments sum it up) I can say I have tried a lot of weird # of splits / # of workers combos on this patch and it has never had trouble. So I think we're good there. I did (and can step it up if you like) add comments such that anyone trying to implement a recovery plan for failed input reader workers might want to revisit this in the future, but for now I think we're safe and this really cuts down the "dead time" when all splits are read and a bunch of workers took forever to figure out the superstep was already effectively over.

The iterator is a great idea, I can do that right now...patch up in a minute...


                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch, GIRAPH-301-7.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438950#comment-13438950 ] 

Alessandro Presta commented on GIRAPH-301:
------------------------------------------

Sorry I haven't gotten around to reviewing this yet. Will check out the latest patch in the next two days, possibly tomorrow.
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436630#comment-13436630 ] 

Alessandro Presta commented on GIRAPH-301:
------------------------------------------

Sure, let me know when you've done enough testing, and if nobody objects I'll commit.
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443950#comment-13443950 ] 

Alessandro Presta commented on GIRAPH-301:
------------------------------------------

Thanks for clarifying. Sounds like you've done your good share of testing, so I'm +1 on this.
I'll just wait for the suggested change and then I'll commit!
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch, GIRAPH-301-7.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444187#comment-13444187 ] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

You know having the iterator I was tempted to make the changes listed in GIRAPH-307 to encapsulate the input split list read and everything inside the LocalityInfoSorter (and maybe go ahead and rename it once it has taken on the responsibility) but I'll save that for a rebase of that patch after this is in.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch, GIRAPH-301-7.patch, GIRAPH-301-8.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438952#comment-13438952 ] 

Alessandro Presta commented on GIRAPH-301:
------------------------------------------

*few days
                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch, GIRAPH-301-6.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-301:
-------------------------------

    Attachment: GIRAPH-301-5.patch

One more quick tweak to make sure locality is maintained.

Will test scale-out even further, but so far all tests are excellent and this method is working much better than the earlier ideas. Contention for splits and extra ZK reads are minimized is absolutely minimal, locality is maintained, and the input superstep is even faster.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch, GIRAPH-301-5.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able to load input splits extremely quickly, and this has altered the behavior of Giraph during INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload too much data too quick) while many (often most) of the others just sleep through the superstep, never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake up if another worker finishes a split, then contend with that worker for another split, while the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are cheap, only writes are not) this patch is able to get every worker involved, and keep them in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without overwhelming Netty by spreading the memory load the split readers bear more evenly. If the giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira