You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Eli Reisman (JIRA)" <ji...@apache.org> on 2012/08/18 22:12:37 UTC

[jira] [Created] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Eli Reisman created GIRAPH-307:
----------------------------------

             Summary: InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
                 Key: GIRAPH-307
                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
             Project: Giraph
          Issue Type: Improvement
          Components: bsp, graph
    Affects Versions: 0.2.0
            Reporter: Eli Reisman
            Assignee: Eli Reisman
            Priority: Minor
             Fix For: 0.2.0


While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447061#comment-13447061 ] 

Eli Reisman commented on GIRAPH-307:
------------------------------------

going to rebase this now that 301 & 318 are in, will post patch ASAP.


                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Maja Kabiljo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470401#comment-13470401 ] 

Maja Kabiljo commented on GIRAPH-307:
-------------------------------------

I see, reading this list is fast comparing to other things happening at that time. But still if we don't need to read it multiple times we shouldn't.

Thanks, Eli, +1. Unless somebody has an objection, I'll commit this tonight.
                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch, GIRAPH-307-2.patch, GIRAPH-307-3.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Maja Kabiljo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471052#comment-13471052 ] 

Maja Kabiljo commented on GIRAPH-307:
-------------------------------------

If I see correctly the build failed for some strange reason which have happened before. What do we do in this case?
                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch, GIRAPH-307-2.patch, GIRAPH-307-3.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-307:
-------------------------------

    Attachment: GIRAPH-307-1.patch

This also attempts to re-use a single LocalityInfoSorter by making it the repository for the input split list until all splits have been read and the worker returns "null" from reserveInputSplit()

passes mvn verify, will test on cluster ASAP and report back results.

                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471061#comment-13471061 ] 

Avery Ching commented on GIRAPH-307:
------------------------------------

I just restarted it.  https://builds.apache.org/job/Giraph-trunk-Commit/229/, let's see how it does this time.
                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch, GIRAPH-307-2.patch, GIRAPH-307-3.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-307:
-------------------------------

    Attachment: GIRAPH-307-3.patch

Thanks again Maja, I rebased this and fixed the test name. It passed mvn verify again now.

It should reduce ZK traffic during input superstep but in the brief testing I did it did not trim much time off input superstep. Its just a small fix I think. If I recall it prevents the repeated calls to ZK and the rebuild of the path list for every iteration on the list by all workers when the list itself never changes.

Thanks again for the review!
                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch, GIRAPH-307-2.patch, GIRAPH-307-3.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Maja Kabiljo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469490#comment-13469490 ] 

Maja Kabiljo commented on GIRAPH-307:
-------------------------------------

Looks good to me. Just one comment, can you please change the name of the test to reflect the class name change? 
Did you see any speed improvement because of less zookeeper reads?
                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch, GIRAPH-307-2.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471065#comment-13471065 ] 

Avery Ching commented on GIRAPH-307:
------------------------------------

This one passed. https://builds.apache.org/job/Giraph-trunk-Commit/229/
                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch, GIRAPH-307-2.patch, GIRAPH-307-3.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Reisman updated GIRAPH-307:
-------------------------------

    Attachment: GIRAPH-307-2.patch

Rebased this patch to be up to date with trunk as of Sept 17th. Since this patch also gives full responsibility to the LocalityInfoSorter for loading, storing, and iterating on the InputSplit path list, I changed its name to reflect its new level of responsibility and make the object's lifecycle more obvious in the code.

Works, passes mvn verify, etc. should be ready for review.
                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch, GIRAPH-307-2.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-307) InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471060#comment-13471060 ] 

Avery Ching commented on GIRAPH-307:
------------------------------------

In this case, it was a problem on the Hudson side.

https://builds.apache.org/job/Giraph-trunk-Commit/228/console

So log in to hudson and run it again =).
                
> InputSplit list can be long with many workers (and locality info) and should not be re-created every time a worker calls reserveInputSplit()
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-307
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-307
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-307-1.patch, GIRAPH-307-2.patch, GIRAPH-307-3.patch
>
>
> While instrumenting the INPUT_SUPERSTEP and watching various runs, I see the input split list generated every time a worker calls reserveInputSplit is, for all intents and purposes, immutable per job. Therefore, we can save a fair amount of memory by not re-creating the list and re-querying ZooKeeper on each pass to claim another split. Only the reserved and finished children lists are ever mutated during the input phase of the job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira