You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Severin Andreas Corsten <se...@de.ibm.com> on 2011/09/08 20:07:32 UTC

Input And Partitioning

Hello everyone,

I tried writing my own code using Giraph and like it a lot but I have some 
questions regarding the input splinting and the partitioning of the graph.

1: Am I right in the assumption that Giraph does not split the input file 
by itself. Assume that I have got a graph in one single file, Giraph sends 
the whole graph to one worker while the rest of the workers is just idle. 

2: I read through the source code and found a part saying that vertices 
must be presented in id-order. Is that a task the user has to do or is 
there a workaround to have vertices not in id-order?

3: The VertexRange class provides the assignment between vertices and 
workers. Is there  a way to override the standard implementation and use a 
custom assignment system?

Thanks in advance.

Kind regards / Mit freundlichen Grüßen

Severin Andreas Corsten
DHBW-Student Business Informatics 2009 - University Programs
IBM Sales & Distribution, Human Resources
WI09N-M

Phone: 1-408-927-2750 
Mobile (Germany): 49-160-98976935 
E-mail: severin.corsten@de.ibm.com 


Hechtsheimer Str. 2
Mainz, 55131
Germany 


IBM Deutschland Management & Business Support GmbH / Vorsitzender des 
Aufsichtsrats: Martin Jetter
Geschäftsführung: Martina Koederitz (Vorsitzende), Reinhard Reschke, 
Dieter Scholz, Gregor Pillen, Joachim Heel, Christian Noll
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, 
HRB 24938

Re: Input And Partitioning

Posted by Severin Andreas Corsten <se...@de.ibm.com>.
Thank you very much, you helped me a lot. 

Kind regards / Mit freundlichen Grüßen

Severin Andreas Corsten
DHBW-Student Business Informatics 2009 - University Programs
IBM Sales & Distribution, Human Resources
WI09N-M

Phone: 1-408-927-2750 
E-mail: severin.corsten@de.ibm.com 


Hechtsheimer Str. 2
Mainz, 55131
Germany 


IBM Deutschland Management & Business Support GmbH / Vorsitzender des 
Aufsichtsrats: Martin Jetter
Geschäftsführung: Martina Koederitz (Vorsitzende), Reinhard Reschke, 
Dieter Scholz, Gregor Pillen, Joachim Heel, Christian Noll
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, 
HRB 24938



From:   Avery Ching <ac...@apache.org>
To:     giraph-user@incubator.apache.org
Cc:     Joseph Boyd <jo...@cbsinteractive.com>
Date:   08.09.2011 13:29
Subject:        Re: Input And Partitioning



Great answers and suggestion Joe.  I just wanted to inline a few 
comments to what Joe wrote.

On 9/8/11 11:27 AM, Joseph Boyd wrote:
> On Thu, Sep 8, 2011 at 11:07 AM, Severin Andreas Corsten
> <se...@de.ibm.com>  wrote:
>
>> 1: Am I right in the assumption that Giraph does not split the input 
file by itself. Assume that I have got a graph in one single file,
>> Giraph sends the whole graph to one worker while the rest of the 
workers is just idle.
> Giraph uses the number of InputSplits returned by your
> VertexInputFormat.getSplits() implementation.
>
> For VertextInputFormats that wrap Hadoop TextInputFormat, what you've
> said will be true, and graph input in one, small file, will all be
> sent to one worker.  As a cheap work-around for this, we've the
> FileInputFormat split size arbitrarily small :
>            FileInputFormat.setMaxInputSplitSize(bspJob, 1048576); //
> number of bytes in one meg
>
> Additionally, Giraph has re-balancing features that can give work to
> under-used workers in subsequent supersteps, but I haven't played with
> them.  (I'm not sure they would have helped me anyway, as my graph,
> even though it had a small input file, wouldn't fit into memory on one
> worker).
>
>
>> 2: I read through the source code and found a part saying that vertices 
must be presented in id-order. Is that a task the user has
>> to do or is there a workaround to have vertices not in id-order?
> Sorting the input into Id-order is for the user to do.  There are open
> JIRAs, like GIRAPH-11 [1] to improve the situation here.
>
>
>> 3: The VertexRange class provides the assignment between vertices and 
workers. Is there  a way to override the
>> standard implementation and use a custom assignment system?
> I have no idea, but the work in GIRAPH-11 will probably give a clue
> what's involved.
GIRAPH-11 will change a lot about the way vertices are assigned.  There 
will be an option for hashing, hash ranges, or user-defined ranges. 
There is also a way to control the assignment of vertex ranges to at 
some level right now (this will likely change a bit as well after 
GIRAPH-11).

In GiraphJob, there is a method

     /**
      * Set the vertex range balancer class (optional)
      *
      * @param vertexRangeBalancerClass Determines how vertex
      *        ranges are balanced prior to each superstep
      */
     final public void setVertexRangeBalancerClass(
             Class<?> vertexRangeBalancerClass) {
         getConfiguration().setClass(VERTEX_RANGE_BALANCER_CLASS,
                                     vertexRangeBalancerClass,
                                     VertexRangeBalancer.class);
     }

By default, we use the StaticBalancer, it doesn't move vertices at all. 
There is also an AutoBalancer that tries to balance the graph based on 
vertices or edges.  You can also write you own.  Hope that helps.
>
>
> ...joe
>
>
> [1]  https://issues.apache.org/jira/browse/GIRAPH-11
>
>
>
>
>
>> Thanks in advance.
>>
>> Kind regards / Mit freundlichen Grüßen
>>
>> Severin Andreas Corsten
>> DHBW-Student Business Informatics 2009 - University Programs
>> IBM Sales&  Distribution, Human Resources
>> WI09N-M
>> ________________________________
>> Phone: 1-408-927-2750
>> Mobile (Germany): 49-160-98976935
>> E-mail: severin.corsten@de.ibm.com
>>
>> Hechtsheimer Str. 2
>> Mainz, 55131
>> Germany



Re: Input And Partitioning

Posted by Avery Ching <ac...@apache.org>.
Great answers and suggestion Joe.  I just wanted to inline a few 
comments to what Joe wrote.

On 9/8/11 11:27 AM, Joseph Boyd wrote:
> On Thu, Sep 8, 2011 at 11:07 AM, Severin Andreas Corsten
> <se...@de.ibm.com>  wrote:
>
>> 1: Am I right in the assumption that Giraph does not split the input file by itself. Assume that I have got a graph in one single file,
>> Giraph sends the whole graph to one worker while the rest of the workers is just idle.
> Giraph uses the number of InputSplits returned by your
> VertexInputFormat.getSplits() implementation.
>
> For VertextInputFormats that wrap Hadoop TextInputFormat, what you've
> said will be true, and graph input in one, small file, will all be
> sent to one worker.  As a cheap work-around for this, we've the
> FileInputFormat split size arbitrarily small :
>            FileInputFormat.setMaxInputSplitSize(bspJob, 1048576); //
> number of bytes in one meg
>
> Additionally, Giraph has re-balancing features that can give work to
> under-used workers in subsequent supersteps, but I haven't played with
> them.  (I'm not sure they would have helped me anyway, as my graph,
> even though it had a small input file, wouldn't fit into memory on one
> worker).
>
>
>> 2: I read through the source code and found a part saying that vertices must be presented in id-order. Is that a task the user has
>> to do or is there a workaround to have vertices not in id-order?
> Sorting the input into Id-order is for the user to do.  There are open
> JIRAs, like GIRAPH-11 [1] to improve the situation here.
>
>
>> 3: The VertexRange class provides the assignment between vertices and workers. Is there  a way to override the
>> standard implementation and use a custom assignment system?
> I have no idea, but the work in GIRAPH-11 will probably give a clue
> what's involved.
GIRAPH-11 will change a lot about the way vertices are assigned.  There 
will be an option for hashing, hash ranges, or user-defined ranges.  
There is also a way to control the assignment of vertex ranges to at 
some level right now (this will likely change a bit as well after 
GIRAPH-11).

In GiraphJob, there is a method

     /**
      * Set the vertex range balancer class (optional)
      *
      * @param vertexRangeBalancerClass Determines how vertex
      *        ranges are balanced prior to each superstep
      */
     final public void setVertexRangeBalancerClass(
             Class<?> vertexRangeBalancerClass) {
         getConfiguration().setClass(VERTEX_RANGE_BALANCER_CLASS,
                                     vertexRangeBalancerClass,
                                     VertexRangeBalancer.class);
     }

By default, we use the StaticBalancer, it doesn't move vertices at all.  
There is also an AutoBalancer that tries to balance the graph based on 
vertices or edges.  You can also write you own.  Hope that helps.
>
>
> ...joe
>
>
> [1]  https://issues.apache.org/jira/browse/GIRAPH-11
>
>
>
>
>
>> Thanks in advance.
>>
>> Kind regards / Mit freundlichen Grüßen
>>
>> Severin Andreas Corsten
>> DHBW-Student Business Informatics 2009 - University Programs
>> IBM Sales&  Distribution, Human Resources
>> WI09N-M
>> ________________________________
>> Phone: 1-408-927-2750
>> Mobile (Germany): 49-160-98976935
>> E-mail: severin.corsten@de.ibm.com
>>
>> Hechtsheimer Str. 2
>> Mainz, 55131
>> Germany


Re: Input And Partitioning

Posted by Joseph Boyd <jo...@cbsinteractive.com>.
On Thu, Sep 8, 2011 at 11:07 AM, Severin Andreas Corsten
<se...@de.ibm.com> wrote:

> 1: Am I right in the assumption that Giraph does not split the input file by itself. Assume that I have got a graph in one single file,
> Giraph sends the whole graph to one worker while the rest of the workers is just idle.

Giraph uses the number of InputSplits returned by your
VertexInputFormat.getSplits() implementation.

For VertextInputFormats that wrap Hadoop TextInputFormat, what you've
said will be true, and graph input in one, small file, will all be
sent to one worker.  As a cheap work-around for this, we've the
FileInputFormat split size arbitrarily small :
          FileInputFormat.setMaxInputSplitSize(bspJob, 1048576); //
number of bytes in one meg

Additionally, Giraph has re-balancing features that can give work to
under-used workers in subsequent supersteps, but I haven't played with
them.  (I'm not sure they would have helped me anyway, as my graph,
even though it had a small input file, wouldn't fit into memory on one
worker).


> 2: I read through the source code and found a part saying that vertices must be presented in id-order. Is that a task the user has
> to do or is there a workaround to have vertices not in id-order?

Sorting the input into Id-order is for the user to do.  There are open
JIRAs, like GIRAPH-11 [1] to improve the situation here.


> 3: The VertexRange class provides the assignment between vertices and workers. Is there  a way to override the
> standard implementation and use a custom assignment system?

I have no idea, but the work in GIRAPH-11 will probably give a clue
what's involved.



...joe


[1]  https://issues.apache.org/jira/browse/GIRAPH-11





> Thanks in advance.
>
> Kind regards / Mit freundlichen Grüßen
>
> Severin Andreas Corsten
> DHBW-Student Business Informatics 2009 - University Programs
> IBM Sales & Distribution, Human Resources
> WI09N-M
> ________________________________
> Phone: 1-408-927-2750
> Mobile (Germany): 49-160-98976935
> E-mail: severin.corsten@de.ibm.com
>
> Hechtsheimer Str. 2
> Mainz, 55131
> Germany