You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Avery Ching (JIRA)" <ji...@apache.org> on 2011/08/29 05:07:40 UTC

[jira] [Created] (GIRAPH-11) Improve the graph distribution of Giraph

Improve the graph distribution of Giraph
----------------------------------------

                 Key: GIRAPH-11
                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
             Project: Giraph
          Issue Type: Improvement
            Reporter: Avery Ching
            Assignee: Avery Ching


Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.

Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.

Design goals for the graph distribution:

* Allow vertices to be unordered or unordered
* Ability to repartition
* Select the partitioning scheme based on user needs (i.e. hash or range based)
* Ability to provide user-specific hints about partitions

Hash-based partitioning

* Good vertex balancing across ranges for random data
* Bad at vertex id locality

Range-based partitioning

* Good at vertex id locality
* Ability to split ranges easily
* Can cause hotspots for hot ranges


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Hyunsik Choi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150110#comment-13150110 ] 

Hyunsik Choi commented on GIRAPH-11:
------------------------------------

I also think that it's great at this moment.

+1
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150095#comment-13150095 ] 

Jakob Homan commented on GIRAPH-11:
-----------------------------------

I think this is ready to go.  Avery, just out of curiosity, beyond the MR unittests, have you run any test vertices on this?  
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149902#comment-13149902 ] 

jiraposter@reviews.apache.org commented on GIRAPH-11:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2788/#review3211
-----------------------------------------------------------


Overall I like it.  Please avoid non-essential format changes in large patches; when reviewing it's like trying to run a marathon with a pebble in your shoe.  There needs to be quite a bit of unit test coverage on the new classes.  Most of them should be amenable to straight-up unit tests rather than ZK-involved integration tests.


http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java
<https://reviews.apache.org/r/2788/#comment7131>

    return type has changed. javadoc needs updated.



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java
<https://reviews.apache.org/r/2788/#comment7152>

    switch statement?



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepHashPartitioner.java
<https://reviews.apache.org/r/2788/#comment7156>

    This seems like a dangerous thing to leave lying around, even for example purposes.  Is there another example that we can generate which might be more useful?



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java
<https://reviews.apache.org/r/2788/#comment7158>

    indent +4



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java
<https://reviews.apache.org/r/2788/#comment7159>

    need debugging guards here and +2 lines



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java
<https://reviews.apache.org/r/2788/#comment7161>

    log guard



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java
<https://reviews.apache.org/r/2788/#comment7162>

    ditto



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
<https://reviews.apache.org/r/2788/#comment7168>

    typo. send -> sent



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
<https://reviews.apache.org/r/2788/#comment7173>

    typo. sent -> send



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/GraphPartitioner.java
<https://reviews.apache.org/r/2788/#comment7178>

    Better to call it a factory?



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java
<https://reviews.apache.org/r/2788/#comment7180>

    typo: dependant -> dependent



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionBalancer.java
<https://reviews.apache.org/r/2788/#comment7182>

    rename: value -> totalValue, to be consistent with usage.



http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java
<https://reviews.apache.org/r/2788/#comment7185>

    I'm unclear on this.


- Jakob


On 2011-11-14 06:56:19, Avery Ching wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2788/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-11-14 06:56:19)
bq.  
bq.  
bq.  Review request for giraph.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Warning: This is a very large change!
bq.  
bq.  Vertex ranges no longer exist.  A generic partitioner handles the
bq.  division of vertex ids to partitions.  As a default, there is a
bq.  HashPartitioner and a HashRangePartitioner that will use the hashCode
bq.  of a Java object to decide which partition to place the vertex.
bq.  Developers can write their own algorithm to determine how to change
bq.  the partitioning as well as implement the assignment of partitions to
bq.  workers.  All vertices loaded from the input split are sent to the
bq.  owner of the partition rather than loaded locally.  This eliminates the
bq.  constraint that the vertices must be ordered in the input split.
bq.  
bq.  The checkpoint format has been changed to suit the new partition
bq.  style.  Checkpoints are now a lot simpler.  The master will assign
bq.  partitions and the workers will only load their own partitions from
bq.  the checkpoint.
bq.  
bq.  Unfortunately, the vertex range implementation was baked into almost
bq.  every aspect of the code (hence the ridiculous size of this diff).
bq.  But now it should be flexible to support several different graph
bq.  partitioning schemes (i.e. hash-based, hash-ranged-based, and for
bq.  special cases, fully ranged-based).
bq.  
bq.  Sorry for the long delay, but this way pretty involved.
bq.  
bq.  
bq.  This addresses bug GIRAPH-11.
bq.      https://issues.apache.org/jira/browse/GIRAPH-11
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/pom.xml 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PseudoRandomVertexInputFormat.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/CommunicationsInterface.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/RPCCommunications.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/ServerInterface.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MaxAggregator.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MinAggregator.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepBalancer.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepHashPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/AutoBalancer.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertex.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertexRangeBalancer.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GlobalStats.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphState.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/LongDoubleFloatDoubleVertex.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MutableVertex.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/StaticBalancer.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Vertex.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexEdgeCount.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRange.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRangeBalancer.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/BasicPartitionOwner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/GraphPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangePartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangeWorkerPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/MasterGraphPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/Partition.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionBalancer.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionExchange.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionOwner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionStats.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionUtils.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeMasterPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionOwner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionStats.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeSplitHint.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/WorkerGraphPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/utils/WritableUtils.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperExt.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestMutateGraphVertex.java 1201607 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestVertexRangeBalancer.java 1186590 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestVertexRangeBalancer.java 1201607 
bq.  
bq.  Diff: https://reviews.apache.org/r/2788/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  local and MR unittests.  Added some simple unittests for testing the out-of-order input splits and other balancing algorithms.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Avery
bq.  
bq.


                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Hyunsik Choi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149412#comment-13149412 ] 

Hyunsik Choi commented on GIRAPH-11:
------------------------------------

Avery, 

I'm sorry for delaying the review. Now, I'm digging your patch. 
That looks great! Based on this work, we can consider some advanced graph partitioner based on the number of edge-cuts on graph partitions.

I need about one more day for more investigation because the patch is somewhat complicated for me :) 

Besides, for the deeper review, I would like to execute the some tests and trace them. Your patch needs the rebase. Could you rebase the patch?

Thank you :)
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Hyunsik Choi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149485#comment-13149485 ] 

Hyunsik Choi commented on GIRAPH-11:
------------------------------------

You are welcome. But, the second patch still occurs the following error:

{code}
hyunsik@code:~/Code/giraph/giraph-trunk$ patch -p0 < ~/Downloads/GIRAPH-11.2.diff patching file pom.xml
patching file src/main/java/org/apache/giraph/benchmark/PseudoRandomVertexInputFormat.java
patching file src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java
patching file src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java
patching file src/main/java/org/apache/giraph/comm/CommunicationsInterface.java
patching file src/main/java/org/apache/giraph/comm/RPCCommunications.java
patching file src/main/java/org/apache/giraph/comm/ServerInterface.java
patching file src/main/java/org/apache/giraph/comm/WorkerCommunications.java
patching file src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java
patching file src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java
patching file src/main/java/org/apache/giraph/examples/MaxAggregator.java
patching file src/main/java/org/apache/giraph/examples/MinAggregator.java
patching file src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java
patching file src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java
patching file src/main/java/org/apache/giraph/examples/SuperstepBalancer.java
patching file src/main/java/org/apache/giraph/examples/SuperstepHashPartitioner.java
patching file src/main/java/org/apache/giraph/examples/VerifyMessage.java
patching file src/main/java/org/apache/giraph/graph/AutoBalancer.java
patching file src/main/java/org/apache/giraph/graph/BasicVertex.java
patching file src/main/java/org/apache/giraph/graph/BasicVertexRangeBalancer.java
patching file src/main/java/org/apache/giraph/graph/BspService.java
patching file src/main/java/org/apache/giraph/graph/BspServiceMaster.java
patching file src/main/java/org/apache/giraph/graph/BspServiceWorker.java
patching file src/main/java/org/apache/giraph/graph/BspUtils.java
patching file src/main/java/org/apache/giraph/graph/GiraphJob.java
patching file src/main/java/org/apache/giraph/graph/GlobalStats.java
patching file src/main/java/org/apache/giraph/graph/GraphMapper.java
patching file src/main/java/org/apache/giraph/graph/GraphState.java
patching file src/main/java/org/apache/giraph/graph/LongDoubleFloatDoubleVertex.java
patching file src/main/java/org/apache/giraph/graph/MutableVertex.java
patching file src/main/java/org/apache/giraph/graph/StaticBalancer.java
patching file src/main/java/org/apache/giraph/graph/Vertex.java
patching file src/main/java/org/apache/giraph/graph/VertexEdgeCount.java
patching file src/main/java/org/apache/giraph/graph/VertexRange.java
patching file src/main/java/org/apache/giraph/graph/VertexRangeBalancer.java
patching file src/main/java/org/apache/giraph/graph/WorkerInfo.java
patching file src/main/java/org/apache/giraph/graph/partition/BasicPartitionOwner.java
patching file src/main/java/org/apache/giraph/graph/partition/GraphPartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/HashPartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/HashRangePartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/HashRangeWorkerPartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/MasterGraphPartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/Partition.java
patching file src/main/java/org/apache/giraph/graph/partition/PartitionBalancer.java
patching file src/main/java/org/apache/giraph/graph/partition/PartitionExchange.java
patching file src/main/java/org/apache/giraph/graph/partition/PartitionOwner.java
patching file src/main/java/org/apache/giraph/graph/partition/PartitionStats.java
patching file src/main/java/org/apache/giraph/graph/partition/PartitionUtils.java
patching file src/main/java/org/apache/giraph/graph/partition/RangeMasterPartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/RangePartitionOwner.java
patching file src/main/java/org/apache/giraph/graph/partition/RangePartitionStats.java
patching file src/main/java/org/apache/giraph/graph/partition/RangePartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/RangeSplitHint.java
patching file src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java
patching file src/main/java/org/apache/giraph/graph/partition/WorkerGraphPartitioner.java
patching file src/main/java/org/apache/giraph/utils/WritableUtils.java
patching file src/main/java/org/apache/giraph/zk/ZooKeeperExt.java
patching file src/test/java/org/apache/giraph/TestMutateGraphVertex.java
patching file src/test/java/org/apache/giraph/TestVertexRangeBalancer.java
Hunk #2 FAILED at 52.
Hunk #3 FAILED at 86.
Hunk #4 FAILED at 106.
Hunk #5 succeeded at 131 (offset 6 lines).
3 out of 5 hunks FAILED -- saving rejects to file src/test/java/org/apache/giraph/TestVertexRangeBalancer.java.rej
patching file src/test/java/org/apache/giraph/TestVertexRangeBalancer.java
{code}

It may be caused by the moved file (i.e., TestVertexRangeBalancer.java -> TestGraphPartitioner.java). First of all, I manually merged the two files for review.

Thank you

                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147916#comment-13147916 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

Sooo...anyone care to look at this?  I'd really like to get it since the old graph partitioning sucks.  One we have it in and test a bit, hopefully we can cut a release now..
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-11:
------------------------------

    Attachment: GIRAPH-11.3.diff
    
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Severin Corsten (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101682#comment-13101682 ] 

Severin Corsten commented on GIRAPH-11:
---------------------------------------

Can you clarify the difference between hash based and hash range based? Is the difference just partition operator, which is modulo for hash based and a rang function for hash range based?

I like the idea to use the hashCode() function and that the user has control about the used algorithm. I think that a better locality leads to better performance, because messages don't need to be sent over the network and no lookups have to be performed.

> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150005#comment-13150005 ] 

jiraposter@reviews.apache.org commented on GIRAPH-11:
-----------------------------------------------------



bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > Overall I like it.  Please avoid non-essential format changes in large patches; when reviewing it's like trying to run a marathon with a pebble in your shoe.  There needs to be quite a bit of unit test coverage on the new classes.  Most of them should be amenable to straight-up unit tests rather than ZK-involved integration tests.

Since it's a straight switch from VertexRange objects to Partition objects, everything is tested by the same existing integration tests.  I have also added a few more integration tests that ensure reverse ordering, different algorithms, etc.  I agree more unittestting should be done rather than integration testing.  If you don't mind, I'd like to add those in a later issue as this one is already too big and needs to be committed soon.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java, line 111
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57771#file57771line111>
bq.  >
bq.  >     return type has changed. javadoc needs updated.

Fixed.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java, line 62
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57781#file57781line62>
bq.  >
bq.  >     switch statement?

In this case, I prefer the if/else if/else logic due since I have to scope each block, making the switch a bit long with the extra cscoping i.e. case 3: { long...


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepHashPartitioner.java, line 49
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57784#file57784line49>
bq.  >
bq.  >     This seems like a dangerous thing to leave lying around, even for example purposes.  Is there another example that we can generate which might be more useful?

I have moved it to the test (TestGraphPartitioner), so that people don't just use it.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java, line 123
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57785#file57785line123>
bq.  >
bq.  >     indent +4

Fixed.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java, line 128
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57785#file57785line128>
bq.  >
bq.  >     need debugging guards here and +2 lines

Fixed.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java, line 157
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57785#file57785line157>
bq.  >
bq.  >     log guard

Fixed.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java, line 162
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57785#file57785line162>
bq.  >
bq.  >     ditto

Fixed.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java, line 745
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57791#file57791line745>
bq.  >
bq.  >     typo. send -> sent

Fixed.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java, line 1579
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57791#file57791line1579>
bq.  >
bq.  >     typo. sent -> send

Fixed.  This is sad, especially since English is my only language =).


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/GraphPartitioner.java, line 33
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57806#file57806line33>
bq.  >
bq.  >     Better to call it a factory?

Agreed, changed.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java, line 76
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57811#file57811line76>
bq.  >
bq.  >     typo: dependant -> dependent

Changed.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionBalancer.java, line 123
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57814#file57814line123>
bq.  >
bq.  >     rename: value -> totalValue, to be consistent with usage.

Changed.


bq.  On 2011-11-14 20:54:28, Jakob Homan wrote:
bq.  > http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java, line 117
bq.  > <https://reviews.apache.org/r/2788/diff/2/?file=57824#file57824line117>
bq.  >
bq.  >     I'm unclear on this.

RangePartitionerFactory unfortunately is abstract, needs implementations of various index types.  A developer can use RangeWorkerPartitioner as something to help them out for their particular implementation.  This is somewhat experimental work, but the idea is that it will allow very very advanced users to customize partiitoning based on a range for their particular index type.  I am making this class abstract with a big notice on what needs to be done if you want to use it.


- Avery


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2788/#review3211
-----------------------------------------------------------


On 2011-11-14 22:24:27, Avery Ching wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2788/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-11-14 22:24:27)
bq.  
bq.  
bq.  Review request for giraph.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Warning: This is a very large change!
bq.  
bq.  Vertex ranges no longer exist.  A generic partitioner handles the
bq.  division of vertex ids to partitions.  As a default, there is a
bq.  HashPartitioner and a HashRangePartitioner that will use the hashCode
bq.  of a Java object to decide which partition to place the vertex.
bq.  Developers can write their own algorithm to determine how to change
bq.  the partitioning as well as implement the assignment of partitions to
bq.  workers.  All vertices loaded from the input split are sent to the
bq.  owner of the partition rather than loaded locally.  This eliminates the
bq.  constraint that the vertices must be ordered in the input split.
bq.  
bq.  The checkpoint format has been changed to suit the new partition
bq.  style.  Checkpoints are now a lot simpler.  The master will assign
bq.  partitions and the workers will only load their own partitions from
bq.  the checkpoint.
bq.  
bq.  Unfortunately, the vertex range implementation was baked into almost
bq.  every aspect of the code (hence the ridiculous size of this diff).
bq.  But now it should be flexible to support several different graph
bq.  partitioning schemes (i.e. hash-based, hash-ranged-based, and for
bq.  special cases, fully ranged-based).
bq.  
bq.  Sorry for the long delay, but this way pretty involved.
bq.  
bq.  
bq.  This addresses bug GIRAPH-11.
bq.      https://issues.apache.org/jira/browse/GIRAPH-11
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionStats.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionUtils.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionBalancer.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionExchange.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionOwner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/Partition.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/MasterGraphPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangeWorkerPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashPartitionerFactory.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangePartitionerFactory.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/GraphPartitionerFactory.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRangeBalancer.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/BasicPartitionOwner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/StaticBalancer.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Vertex.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexEdgeCount.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRange.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GlobalStats.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphState.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/LongDoubleFloatDoubleVertex.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MutableVertex.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertex.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertexRangeBalancer.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/AutoBalancer.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepBalancer.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MaxAggregator.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MinAggregator.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/ServerInterface.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/CommunicationsInterface.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/RPCCommunications.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/pom.xml 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PseudoRandomVertexInputFormat.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeMasterPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionOwner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionStats.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionerFactory.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeSplitHint.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/WorkerGraphPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/utils/WritableUtils.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperExt.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestGraphPartitioner.java PRE-CREATION 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestMutateGraphVertex.java 1201630 
bq.    http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestVertexRangeBalancer.java 1201630 
bq.  
bq.  Diff: https://reviews.apache.org/r/2788/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  local and MR unittests.  Added some simple unittests for testing the out-of-order input splits and other balancing algorithms.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Avery
bq.  
bq.


                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-11:
------------------------------

    Attachment: GIRAPH-11.diff

Here's the diff for those of you who don't like reviewboard =).
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-11:
------------------------------

    Affects Version/s: 0.70.0

> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-11:
------------------------------

    Attachment: GIRAPH-11.2.diff

A recent diff generated from 'svn diff' from trunk that is also present in reviewboard.  Thanks again in advance for the review Hyunsik!
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-11:
------------------------------

    Attachment: GIRAPH-11.4.diff

The diff used for reviewboard based on Jakob's review.
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150097#comment-13150097 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

Ran local tests, MR tests, and ran the page rank benchmark with 100 m vertices.  No doubt we'll need to optimize more, but this needed to get done first.
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098098#comment-13098098 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

I'm going to assume you're asking about the current partitioning.  If I'm wrong, I'll address what we plan to do in the future.  The current partitioning is implemented by assuming that the input splits are sorted globally (i.e. two input split of {A, B, C} {D, E}).  It will break the input splits into vertex ranges where the boundaries will not change.  These vertex ranges can be passed around the workers via several different balancers.  The balancer can be set via setVertexRangeBalancerClass() from GiraphJob or with the right configuration parameter (giraph.vertexRangeBalancerClass).  We have some implementations for a static balancer (no vertex movement, default), and an auto balancer (configurable to balance based on vertices or edges).  You're free to implement your own as well.  Hope that answers some of the questions, let me know if you have more.

> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-11:
------------------------------

    Attachment: GIRAPH-11.3.diff

Sorry Hyunsik.  I gave you the reviewboard diff I had to massage one line with.  This one is a fresh svn diff (no modification).
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-11:
------------------------------

    Attachment:     (was: GIRAPH-11.3.diff)
    
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150137#comment-13150137 ] 

Hudson commented on GIRAPH-11:
------------------------------

Integrated in Giraph-trunk-Commit #31 (See [https://builds.apache.org/job/Giraph-trunk-Commit/31/])
    GIRAPH-11: Improve the graph distribution of Giraph. (aching)

aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1201987
Files : 
* /incubator/giraph/trunk/CHANGELOG
* /incubator/giraph/trunk/pom.xml
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PseudoRandomVertexInputFormat.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/CommunicationsInterface.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/RPCCommunications.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/ServerInterface.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MaxAggregator.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MinAggregator.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepBalancer.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/AutoBalancer.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertex.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertexRangeBalancer.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GlobalStats.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphState.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/LongDoubleFloatDoubleVertex.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MutableVertex.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/StaticBalancer.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Vertex.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexEdgeCount.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRange.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRangeBalancer.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/BasicPartitionOwner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/GraphPartitionerFactory.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashPartitionerFactory.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangePartitionerFactory.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangeWorkerPartitioner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/MasterGraphPartitioner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/Partition.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionBalancer.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionExchange.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionOwner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionStats.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionUtils.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeMasterPartitioner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionOwner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionStats.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionerFactory.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeSplitHint.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/WorkerGraphPartitioner.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/utils/WritableUtils.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperExt.java
* /incubator/giraph/trunk/src/test/java/org/apache/giraph/TestGraphPartitioner.java
* /incubator/giraph/trunk/src/test/java/org/apache/giraph/TestMutateGraphVertex.java
* /incubator/giraph/trunk/src/test/java/org/apache/giraph/TestVertexRangeBalancer.java

                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150006#comment-13150006 ] 

jiraposter@reviews.apache.org commented on GIRAPH-11:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2788/
-----------------------------------------------------------

(Updated 2011-11-14 22:24:27.293676)


Review request for giraph.


Changes
-------

Made changes based on Jakob's review.


Summary
-------

Warning: This is a very large change!

Vertex ranges no longer exist.  A generic partitioner handles the
division of vertex ids to partitions.  As a default, there is a
HashPartitioner and a HashRangePartitioner that will use the hashCode
of a Java object to decide which partition to place the vertex.
Developers can write their own algorithm to determine how to change
the partitioning as well as implement the assignment of partitions to
workers.  All vertices loaded from the input split are sent to the
owner of the partition rather than loaded locally.  This eliminates the
constraint that the vertices must be ordered in the input split.

The checkpoint format has been changed to suit the new partition
style.  Checkpoints are now a lot simpler.  The master will assign
partitions and the workers will only load their own partitions from
the checkpoint.

Unfortunately, the vertex range implementation was baked into almost
every aspect of the code (hence the ridiculous size of this diff).
But now it should be flexible to support several different graph
partitioning schemes (i.e. hash-based, hash-ranged-based, and for
special cases, fully ranged-based).

Sorry for the long delay, but this way pretty involved.


This addresses bug GIRAPH-11.
    https://issues.apache.org/jira/browse/GIRAPH-11


Diffs (updated)
-----

  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionStats.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionUtils.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionBalancer.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionExchange.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionOwner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/Partition.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/MasterGraphPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangeWorkerPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashPartitionerFactory.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangePartitionerFactory.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/GraphPartitionerFactory.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRangeBalancer.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/BasicPartitionOwner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/StaticBalancer.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Vertex.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexEdgeCount.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRange.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GlobalStats.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphState.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/LongDoubleFloatDoubleVertex.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MutableVertex.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertex.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertexRangeBalancer.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/AutoBalancer.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepBalancer.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MaxAggregator.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MinAggregator.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/ServerInterface.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/CommunicationsInterface.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/RPCCommunications.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/pom.xml 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PseudoRandomVertexInputFormat.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeMasterPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionOwner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionStats.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionerFactory.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeSplitHint.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/WorkerGraphPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/utils/WritableUtils.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperExt.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestGraphPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestMutateGraphVertex.java 1201630 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestVertexRangeBalancer.java 1201630 

Diff: https://reviews.apache.org/r/2788/diff


Testing
-------

local and MR unittests.  Added some simple unittests for testing the out-of-order input splits and other balancing algorithms.


Thanks,

Avery


                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Severin Corsten (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106636#comment-13106636 ] 

Severin Corsten commented on GIRAPH-11:
---------------------------------------

yes, it's much clearer. If I can help you, just give me some task. I am not quite sure to what extent my code has the right quality for Giraph bud I will do my best.


> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Jake Mannix (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147194#comment-13147194 ] 

Jake Mannix commented on GIRAPH-11:
-----------------------------------

wow, there's a monster diff!  I'm glad I don't have any outstanding patches to get clobbered!
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150143#comment-13150143 ] 

Jakob Homan commented on GIRAPH-11:
-----------------------------------

ok, I hate to be late to the party on this, but while generating some newbie issues, IntelliJ noticed that in RangeWorkerPartition, the instance variable vertexRangeMap is queried but never updated and, since it's private, can't be accessed by an implementing class.  I believe we need an accessor method for it. 
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147226#comment-13147226 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

I started it so long ago that the merge was brutal for me.  =)
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106839#comment-13106839 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

Hi Severin, as far as getting started, feel free to take a look at any of the open issues (https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+GIRAPH+AND+status+%3D+Open+ORDER+BY+priority+DESC&mode=hide) that are unassigned.  Also, if you think of any interesting features to add, file a JIRA.  We'd be happy to have you contribute =).

> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149468#comment-13149468 ] 

jiraposter@reviews.apache.org commented on GIRAPH-11:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2788/
-----------------------------------------------------------

(Updated 2011-11-14 06:56:19.251685)


Review request for giraph.


Changes
-------

Updated the diff as per Hyunsik's request to build against recent trunk changes.  While I was waiting I added some fixed and additions as well.

Upgrade ZooKeeper to 3.3.3 from 3.3.1.

Fixed bug in PseudoRandomVertexInputFormat.java where the edges are not fully added (hasEdge is not the right place to look for the edge).

Fixed bug in BasicRPCCommunications when putting to a local inPartitionMap

Added counter for last checkpointed superstep

Master should refresh the progress every 60 seconds while waiting for workers to ensure that the job isn't killed

Fixed bugs in vertexCounter, finishedVertexCoutner, edgeCounter, and sentMessages counter not resetting every update (just cumultatively being added).

Add additional helpful status messages for debugging.

Turned off speculative execution for Giraph (not a good idea).

Added analysis of the partition balancing for debugging


Summary
-------

Warning: This is a very large change!

Vertex ranges no longer exist.  A generic partitioner handles the
division of vertex ids to partitions.  As a default, there is a
HashPartitioner and a HashRangePartitioner that will use the hashCode
of a Java object to decide which partition to place the vertex.
Developers can write their own algorithm to determine how to change
the partitioning as well as implement the assignment of partitions to
workers.  All vertices loaded from the input split are sent to the
owner of the partition rather than loaded locally.  This eliminates the
constraint that the vertices must be ordered in the input split.

The checkpoint format has been changed to suit the new partition
style.  Checkpoints are now a lot simpler.  The master will assign
partitions and the workers will only load their own partitions from
the checkpoint.

Unfortunately, the vertex range implementation was baked into almost
every aspect of the code (hence the ridiculous size of this diff).
But now it should be flexible to support several different graph
partitioning schemes (i.e. hash-based, hash-ranged-based, and for
special cases, fully ranged-based).

Sorry for the long delay, but this way pretty involved.


This addresses bug GIRAPH-11.
    https://issues.apache.org/jira/browse/GIRAPH-11


Diffs (updated)
-----

  http://svn.apache.org/repos/asf/incubator/giraph/trunk/pom.xml 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PseudoRandomVertexInputFormat.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/CommunicationsInterface.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/RPCCommunications.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/ServerInterface.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MaxAggregator.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MinAggregator.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepBalancer.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepHashPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/AutoBalancer.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertex.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertexRangeBalancer.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GlobalStats.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphState.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/LongDoubleFloatDoubleVertex.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MutableVertex.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/StaticBalancer.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Vertex.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexEdgeCount.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRange.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRangeBalancer.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/BasicPartitionOwner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/GraphPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangePartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangeWorkerPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/MasterGraphPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/Partition.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionBalancer.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionExchange.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionOwner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionStats.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionUtils.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeMasterPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionOwner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionStats.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeSplitHint.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/WorkerGraphPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/utils/WritableUtils.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperExt.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestMutateGraphVertex.java 1201607 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestVertexRangeBalancer.java 1186590 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestVertexRangeBalancer.java 1201607 

Diff: https://reviews.apache.org/r/2788/diff


Testing
-------

local and MR unittests.  Added some simple unittests for testing the out-of-order input splits and other balancing algorithms.


Thanks,

Avery


                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150869#comment-13150869 ] 

Hudson commented on GIRAPH-11:
------------------------------

Integrated in Giraph-trunk-Commit #34 (See [https://builds.apache.org/job/Giraph-trunk-Commit/34/])
    GIRAPH-88: Message count not updated properly after GIRAPH-11. (aching)

aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1202455
Files : 
* /incubator/giraph/trunk/CHANGELOG
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java

                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Hyunsik Choi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149471#comment-13149471 ] 

Hyunsik Choi commented on GIRAPH-11:
------------------------------------

Thank you for rebase.
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146944#comment-13146944 ] 

jiraposter@reviews.apache.org commented on GIRAPH-11:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2788/
-----------------------------------------------------------

Review request for giraph.


Summary
-------

Warning: This is a very large change!

Vertex ranges no longer exist.  A generic partitioner handles the
division of vertex ids to partitions.  As a default, there is a
HashPartitioner and a HashRangePartitioner that will use the hashCode
of a Java object to decide which partition to place the vertex.
Developers can write their own algorithm to determine how to change
the partitioning as well as implement the assignment of partitions to
workers.  All vertices loaded from the input split are sent to the
owner of the partition rather than loaded locally.  This eliminates the
constraint that the vertices must be ordered in the input split.

The checkpoint format has been changed to suit the new partition
style.  Checkpoints are now a lot simpler.  The master will assign
partitions and the workers will only load their own partitions from
the checkpoint.

Unfortunately, the vertex range implementation was baked into almost
every aspect of the code (hence the ridiculous size of this diff).
But now it should be flexible to support several different graph
partitioning schemes (i.e. hash-based, hash-ranged-based, and for
special cases, fully ranged-based).

Sorry for the long delay, but this way pretty involved.


This addresses bug GIRAPH-11.
    https://issues.apache.org/jira/browse/GIRAPH-11


Diffs
-----

  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java 1196639 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/CommunicationsInterface.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/RPCCommunications.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/ServerInterface.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java 1196639 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java 1196639 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MaxAggregator.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/MinAggregator.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMutateGraphVertex.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java 1196639 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepBalancer.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SuperstepHashPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/VerifyMessage.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/AutoBalancer.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertex.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertexRangeBalancer.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1196639 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java 1198972 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GlobalStats.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java 1198972 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphState.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/LongDoubleFloatDoubleVertex.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MutableVertex.java 1196639 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/StaticBalancer.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Vertex.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRange.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRangeBalancer.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/BasicPartitionOwner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/GraphPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashMasterPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangePartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashRangeWorkerPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/MasterGraphPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/Partition.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionBalancer.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionExchange.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionOwner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/PartitionStats.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeMasterPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionOwner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitionStats.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangePartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeSplitHint.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/RangeWorkerPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/WorkerGraphPartitioner.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/utils/WritableUtils.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperExt.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/site/site.xml 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestMutateGraphVertex.java 1199643 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestVertexRangeBalancer.java 1186590 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestVertexRangeBalancer.java 1199643 

Diff: https://reviews.apache.org/r/2788/diff


Testing
-------

local and MR unittests.  Added some simple unittests for testing the out-of-order input splits and other balancing algorithms.


Thanks,

Avery


                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150099#comment-13150099 ] 

Jakob Homan commented on GIRAPH-11:
-----------------------------------

+1
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150213#comment-13150213 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

No problem.  I can make vertexRangeMap protected instead of private.  
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching resolved GIRAPH-11.
-------------------------------

    Resolution: Fixed

Thanks for the reviews Jakob and Hyunsik.  Closing now that Hudson also approves. =)
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.2.diff, GIRAPH-11.3.diff, GIRAPH-11.4.diff, GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101593#comment-13101593 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

The hash partitioning will be based on hashCode() by default, but the user can implement something they like as well based on the vertex id.  I am designing it to get hash based and hash range based.  In a pure hash-based distribution, you should get great load balancing.  In a hash-range based distribution, the user could possibly get some locality benefits without changing anything from the hash based partitioning.  Then finally, there should be a way for the user to do a pure range based split of the id space, but this requires the most work by the user to specify their division of the id space (depends on the type).

The hash based and hash-range based schemes will be implemented by default and will be selectable by users.  The range based scheme will be a partial implementation since we require users to do the id range partitioning.  Additionally, we will provide the API for users to implement their own graph partitioning scheme.

Let me know what you think.

> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101710#comment-13101710 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

Regarding the difference in hash based and hash rang based, it refers to how the hash code is assigned to a partition.  The application dev will implement hashCode() for their vertex id and then the assignment of the hashCode() to a partition can be hashed (i.e. hashCode() % # partitions) or range based ([0-a),[a-b)...etc).  Hope that's more clear.  Code will help.  It's coming soon, by mid next week I hope.

> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Dan Brickley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097840#comment-13097840 ] 

Dan Brickley commented on GIRAPH-11:
------------------------------------

Is there more detail somewhere on how 'range based' works?

> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149022#comment-13149022 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

Thank you Hyunsik!  Graph partitioning is a little complicated, but I feel it should be flexible enough to support a variety of partitioning and distribution algorithms instead of the current implementation.
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Hyunsik Choi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148187#comment-13148187 ] 

Hyunsik Choi commented on GIRAPH-11:
------------------------------------

That's a huge patch :)
I have just started to explore your patch.
I will leave some comments (maybe tomorrow).
                
> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>    Affects Versions: 0.70.0
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-11.diff
>
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

Posted by "Severin Corsten (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101590#comment-13101590 ] 

Severin Corsten commented on GIRAPH-11:
---------------------------------------

What hashing algorithm do you want to use for the hash-partitioning? Just a hash of the vertexid or a more complex think or is it the users choice?

How do you want to solve the messaging between the vertices when using hash partitioning? Will you store a Map with Vertex -> Worker or provide the hash algorithem to workers, so that they can identify the destination worker by themself? 

> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted.  If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset.  This is often a bit inconvenient.
> Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach.  The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira