You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Jan Ebbing <Ja...@gmx.de> on 2015/07/06 17:18:32 UTC

Questions after writing some giraph code

Hello everybody,

 

first of all thank you for all your work on giraph.

I'm a student writing his bachelor thesis using giraph.

I have already implemented an algorithm that isn't completely trivial, with
my new task, however, I'm running into problems.

 

I implemented my first algorithm as a subclass of MasterCompute where I
register aggregators etc and then do a switch on the superstep and call
setComputation() with the appropriate AbstractComputation subclass I wrote.
(I hope this is how you're supposed to do it)

Now I want to implement a new algorithm that calls my new algorithm as a
subroutine: While the halting condition is not fulfilled, I first partition
the graph using algorithm 1, then continue to work on it with algorithm 1.
Is there a way to do this without duplicating a lot of code of the
"standalone" MasterCompute class of algorithm 1? An additional problem that
arises with this is that the Vertex- and EdgeValue classes would have to be
exactly the same. I tried to work around this by defining a writable
interface that defines all the methods needed for my algorithms which would
make the actual classes exchangeable. However, giraph uses reflection to
create a new instance of exactly the type the computation uses (an
interface), which leads to an error. Is there any good way to do this or are
giraph jobs not that flexible? 

 

In a variation of the first algorithm, I need to register n aggregators (n =
number of vertices). However, the documentation reads like I have to
register aggregators in MasterCompute.initialize() (nowhere else) and at
that point, getTotalNumVertices() does not work yet. I am aware that this is
a very costly operation (I do not use all of those aggregators but I do not
know beforehand which of them I will use and which I will not use), but
currently the only workaround I can think of is using
Configuration.getInt(). (and writing the total number of vertices in such a
config file beforehand)

Also a question on the config files: I understood that there are several of
them that overwrite each other and that e.g. I should not change the
core-default.xml or core-site.xml since they will change my complete Hadoop
installation. I also read somewhere that it's possible to write a config
file just for one job (which would be what I need) but I never found out how
that file should be called/where I have to place it.

 

My biggest problem right now is debugging, though: Is there an easy way to
test giraph code on small sample graphs? Right now, to test my
implementations I have to package my code with maven, copy the long command
into the terminal to run the giraph job (changing the output folder since
they have to be different each time), wait a few minutes for the job to
complete, open the web GUI, click through a few pages there until I see my
debug statements/if the job completed I have to run through a text file via
the console. Compared to what I was used to (1 click in eclipse and almost
instantly seeing the output on the console) this is very annoying,
especially since I do dumb small mistakes like switching the if and else
blocks more often than I'd like to and have to go through the whole process
each time that happens.

I also searched for that, I found GRAFT which seemed to be a useful
debugging tool, but more for suitable for testing on real input and not to
quickly test if the code runs at all on a small input graph.

After searching through this mailing list archive, there were a few
references to running a giraph job with one click in eclipse aswell (see
also [1]), but most descriptions were very vague and I could not reproduce
them.

 

Lastly, one small question: In my first algorithm I had a small bug where I
would use getVertexValue, then change the java object but not call
setVertexValue which resulted in my changes not being saved leading to
undesired behavior. After reading through another giraph algorithm, I
noticed that they do the same (maybe it was with an EdgeValue, I'm not 100%
sure on that) and don't call setXValue, but apparently their code works. Can
anybody shed some light on that? (I understand why it's useful to have an
explicit setVertexValue method for writing/reading vertices to/from disk, I
just don't understand why it is not necessary for them?)

 

Thanks,

Jan

 

 

[1]
http://ben-tech.blogspot.in/2011/08/how-to-debug-hadoop-mapreduce-jobs-in.ht
ml


Re: Questions after writing some giraph code

Posted by Sergey Edunov <ed...@gmail.com>.
Hi Jan,

2. Can you store information related to partitions in vertices themselves?
You can add #of_vertices field to each vertex and then, whenever it is not
0 you consider it as non-empty partition.
Even if you check aggregators only once, just creating N aggregators will
not be efficient and it also undermines distributed nature of Giraph.

3. Test cases generally do the same thing that happens on cluster. The only
difference is IO which is hard to cover with unit tests. Make sure you
properly implement Writable interface for VertexValue, you need to provide
code for serializing and de-serializing all object fields.

4. The reason it works for them is that VertexValue is mutable. They change
the value of existing object and there is no need to set new vertex value.

I'm not sure about configuration, we use our own runner, hence it is
different. You can always pass individual options to the job by adding -D{
option.name}={option.value} to hadoop call. See example here: here:
https://apache.googlesource.com/giraph/+/2e8c2c694c98c4ac7c371a7b9dc0b28abba79ffd/src/site/xdoc/rexster.xml#114




On Mon, Jul 6, 2015 at 1:01 PM, Jan Ebbing <Ja...@gmx.de> wrote:

> Hello,
>
>
>
> thanks for the quick answer.
>
> 1.       I will look into changing my code to use static methods, I think
> this should be possible.
>
> 2.       Aggregators: Each aggregators stores the load (=#vertices) of a
> partition, in the very first step this is 1 for every partition. Then
> partitions will be merged together, leaving most at 0, and only really
> using k (for n >> k) Aggregators to track the state of the surviving
> partitions. I only check the state of all aggregators once in a complete
> iteration of the algorithm (after 1 complete graph partitioning iteration)
>
> 3.       I wrote some unit tests, there’s usually a larger difference I
> would like between what runs locally and what works in giraph (might be my
> fault for writing too little/the wrong tests though). I’m using giraph on a
> pseudodistributed Ubuntu machine on very small graphs since I’m
> implementing the algorithms at the moment. I just had several problems with
> things that worked in eclipse but gave strange errors when I ran the whole
> algorithm in giraph. (e.g. the VertexValue not remembering its partition
> even though the setPartition method was called or an aggregator value that
> was 0 because I obtained it in an odd superstep, since it held a value in
> even supersteps)
>
> 4.       Sure, in [1] at line 311 they call
> ComputeNewPartition.requestMigration() and the method ends there,
> requestMigration calls VertexValue.setNewPartition() which is a basic
> setter without calling setValue at any point.
>
>
>
> Could you shed some light on where I can place a job-specific config file
> which holds information obtainable with
> MasterCompute.getContext().getConfiguration().getInt()/getDouble()/… ?
>
>
>
> Regards,
>
> Jan
>
>
>
> [1]
> https://github.com/grafos-ml/okapi/blob/master/src/main/java/ml/grafos/okapi/spinner/Spinner.java#L311
>
>
>
> *Von:* Sergey Edunov [mailto:edunov@gmail.com]
> *Gesendet:* Montag, 6. Juli 2015 21:41
> *An:* user@giraph.apache.org
> *Betreff:* Re: Questions after writing some giraph code
>
>
>
> Hi Jan,
>
> It's a bit hard to advise without seeing actual code, so my reply might
> seem too generic. Feel free to send specific questions with code samples to
> get more detailed advice.
>
> "Is there a way to do this without duplicating a lot of code of the
> “standalone” MasterCompute class of algorithm" - this sort of thing is
> usually done by abstracting your algorithm out of giraph-related classes.
> You can have a class with static methods that implement your algorithm and
> then all you need to do is to pass data from vertices or master compute
> into this class. This approach has other benefits such as easy testability.
> E.g. you can write unit tests for algorithm.
>
>
>
> "I need to register n aggregators (n = number of vertices)" - this is
> generally a bad sign. How much data you want to store in aggregators?
> Remeber, they will be send other the wire between workers and master. You
> can, of course, get around by registering single Map as aggregator. You'll
> need to wrap it into another class that implements Writable and then
> implement readFields and write functions.
>
>
>
> "My biggest problem right now is debugging" - unit tests are usually very
> good approach. Again, depends a lot on your configuration. We also use
> downsampled graphs a lot to quickly test on cluster.
>
>
> " After reading through another giraph algorithm, I noticed that they do
> the same" - can you point to the example? I suspect that's because they use
> mutable data types or some helper functions to change value.
>
>
>
> Regards,
>
> Sergey Edunov
>
>
>
>
>
> On Mon, Jul 6, 2015 at 8:18 AM, Jan Ebbing <Ja...@gmx.de> wrote:
>
> Hello everybody,
>
>
>
> first of all thank you for all your work on giraph.
>
> I’m a student writing his bachelor thesis using giraph.
>
> I have already implemented an algorithm that isn’t completely trivial,
> with my new task, however, I’m running into problems.
>
>
>
> I implemented my first algorithm as a subclass of MasterCompute where I
> register aggregators etc and then do a switch on the superstep and call
> setComputation() with the appropriate AbstractComputation subclass I wrote.
> (I hope this is how you’re supposed to do it)
>
> Now I want to implement a new algorithm that calls my new algorithm as a
> subroutine: While the halting condition is not fulfilled, I first partition
> the graph using algorithm 1, then continue to work on it with algorithm 1.
> Is there a way to do this without duplicating a lot of code of the
> “standalone” MasterCompute class of algorithm 1? An additional problem that
> arises with this is that the Vertex- and EdgeValue classes would have to be
> exactly the same. I tried to work around this by defining a writable
> interface that defines all the methods needed for my algorithms which would
> make the actual classes exchangeable. However, giraph uses reflection to
> create a new instance of exactly the type the computation uses (an
> interface), which leads to an error. Is there any good way to do this or
> are giraph jobs not that flexible?
>
>
>
> In a variation of the first algorithm, I need to register n aggregators (n
> = number of vertices). However, the documentation reads like I have to
> register aggregators in MasterCompute.initialize() (nowhere else) and at
> that point, getTotalNumVertices() does not work yet. I am aware that this
> is a very costly operation (I do not use all of those aggregators but I do
> not know beforehand which of them I will use and which I will not use), but
> currently the only workaround I can think of is using
> Configuration.getInt(). (and writing the total number of vertices in such a
> config file beforehand)
>
> Also a question on the config files: I understood that there are several
> of them that overwrite each other and that e.g. I should not change the
> core-default.xml or core-site.xml since they will change my complete Hadoop
> installation. I also read somewhere that it’s possible to write a config
> file just for one job (which would be what I need) but I never found out
> how that file should be called/where I have to place it.
>
>
>
> My biggest problem right now is debugging, though: Is there an easy way to
> test giraph code on small sample graphs? Right now, to test my
> implementations I have to package my code with maven, copy the long command
> into the terminal to run the giraph job (changing the output folder since
> they have to be different each time), wait a few minutes for the job to
> complete, open the web GUI, click through a few pages there until I see my
> debug statements/if the job completed I have to run through a text file via
> the console. Compared to what I was used to (1 click in eclipse and almost
> instantly seeing the output on the console) this is very annoying,
> especially since I do dumb small mistakes like switching the if and else
> blocks more often than I’d like to and have to go through the whole process
> each time that happens.
>
> I also searched for that, I found GRAFT which seemed to be a useful
> debugging tool, but more for suitable for testing on real input and not to
> quickly test if the code runs at all on a small input graph.
>
> After searching through this mailing list archive, there were a few
> references to running a giraph job with one click in eclipse aswell (see
> also [1]), but most descriptions were very vague and I could not reproduce
> them.
>
>
>
> Lastly, one small question: In my first algorithm I had a small bug where
> I would use getVertexValue, then change the java object but not call
> setVertexValue which resulted in my changes not being saved leading to
> undesired behavior. After reading through another giraph algorithm, I
> noticed that they do the same (maybe it was with an EdgeValue, I’m not 100%
> sure on that) and don’t call setXValue, but apparently their code works.
> Can anybody shed some light on that? (I understand why it’s useful to have
> an explicit setVertexValue method for writing/reading vertices to/from
> disk, I just don’t understand why it is not necessary for them?)
>
>
>
> Thanks,
>
> Jan
>
>
>
>
>
> [1]
> http://ben-tech.blogspot.in/2011/08/how-to-debug-hadoop-mapreduce-jobs-in.html
>
>
>

AW: Questions after writing some giraph code

Posted by Jan Ebbing <Ja...@gmx.de>.
Hello,

 

thanks for the quick answer.

1.       I will look into changing my code to use static methods, I think this should be possible.

2.       Aggregators: Each aggregators stores the load (=#vertices) of a partition, in the very first step this is 1 for every partition. Then partitions will be merged together, leaving most at 0, and only really using k (for n >> k) Aggregators to track the state of the surviving partitions. I only check the state of all aggregators once in a complete iteration of the algorithm (after 1 complete graph partitioning iteration)

3.       I wrote some unit tests, there’s usually a larger difference I would like between what runs locally and what works in giraph (might be my fault for writing too little/the wrong tests though). I’m using giraph on a pseudodistributed Ubuntu machine on very small graphs since I’m implementing the algorithms at the moment. I just had several problems with things that worked in eclipse but gave strange errors when I ran the whole algorithm in giraph. (e.g. the VertexValue not remembering its partition even though the setPartition method was called or an aggregator value that was 0 because I obtained it in an odd superstep, since it held a value in even supersteps)

4.       Sure, in [1] at line 311 they call ComputeNewPartition.requestMigration() and the method ends there, requestMigration calls VertexValue.setNewPartition() which is a basic setter without calling setValue at any point.

 

Could you shed some light on where I can place a job-specific config file which holds information obtainable with MasterCompute.getContext().getConfiguration().getInt()/getDouble()/… ?

 

Regards,

Jan

 

[1] https://github.com/grafos-ml/okapi/blob/master/src/main/java/ml/grafos/okapi/spinner/Spinner.java#L311 

 

Von: Sergey Edunov [mailto:edunov@gmail.com] 
Gesendet: Montag, 6. Juli 2015 21:41
An: user@giraph.apache.org
Betreff: Re: Questions after writing some giraph code

 

Hi Jan, 

It's a bit hard to advise without seeing actual code, so my reply might seem too generic. Feel free to send specific questions with code samples to get more detailed advice. 

"Is there a way to do this without duplicating a lot of code of the “standalone” MasterCompute class of algorithm" - this sort of thing is usually done by abstracting your algorithm out of giraph-related classes. You can have a class with static methods that implement your algorithm and then all you need to do is to pass data from vertices or master compute into this class. This approach has other benefits such as easy testability. E.g. you can write unit tests for algorithm. 

 

"I need to register n aggregators (n = number of vertices)" - this is generally a bad sign. How much data you want to store in aggregators? Remeber, they will be send other the wire between workers and master. You can, of course, get around by registering single Map as aggregator. You'll need to wrap it into another class that implements Writable and then implement readFields and write functions. 

 

"My biggest problem right now is debugging" - unit tests are usually very good approach. Again, depends a lot on your configuration. We also use downsampled graphs a lot to quickly test on cluster.


" After reading through another giraph algorithm, I noticed that they do the same" - can you point to the example? I suspect that's because they use mutable data types or some helper functions to change value. 

 

Regards,

Sergey Edunov

 

 

On Mon, Jul 6, 2015 at 8:18 AM, Jan Ebbing <Jan.Ebbing@gmx.de <ma...@gmx.de> > wrote:

Hello everybody,

 

first of all thank you for all your work on giraph.

I’m a student writing his bachelor thesis using giraph.

I have already implemented an algorithm that isn’t completely trivial, with my new task, however, I’m running into problems.

 

I implemented my first algorithm as a subclass of MasterCompute where I register aggregators etc and then do a switch on the superstep and call setComputation() with the appropriate AbstractComputation subclass I wrote. (I hope this is how you’re supposed to do it)

Now I want to implement a new algorithm that calls my new algorithm as a subroutine: While the halting condition is not fulfilled, I first partition the graph using algorithm 1, then continue to work on it with algorithm 1. Is there a way to do this without duplicating a lot of code of the “standalone” MasterCompute class of algorithm 1? An additional problem that arises with this is that the Vertex- and EdgeValue classes would have to be exactly the same. I tried to work around this by defining a writable interface that defines all the methods needed for my algorithms which would make the actual classes exchangeable. However, giraph uses reflection to create a new instance of exactly the type the computation uses (an interface), which leads to an error. Is there any good way to do this or are giraph jobs not that flexible? 

 

In a variation of the first algorithm, I need to register n aggregators (n = number of vertices). However, the documentation reads like I have to register aggregators in MasterCompute.initialize() (nowhere else) and at that point, getTotalNumVertices() does not work yet. I am aware that this is a very costly operation (I do not use all of those aggregators but I do not know beforehand which of them I will use and which I will not use), but currently the only workaround I can think of is using Configuration.getInt(). (and writing the total number of vertices in such a config file beforehand)

Also a question on the config files: I understood that there are several of them that overwrite each other and that e.g. I should not change the core-default.xml or core-site.xml since they will change my complete Hadoop installation. I also read somewhere that it’s possible to write a config file just for one job (which would be what I need) but I never found out how that file should be called/where I have to place it.

 

My biggest problem right now is debugging, though: Is there an easy way to test giraph code on small sample graphs? Right now, to test my implementations I have to package my code with maven, copy the long command into the terminal to run the giraph job (changing the output folder since they have to be different each time), wait a few minutes for the job to complete, open the web GUI, click through a few pages there until I see my debug statements/if the job completed I have to run through a text file via the console. Compared to what I was used to (1 click in eclipse and almost instantly seeing the output on the console) this is very annoying, especially since I do dumb small mistakes like switching the if and else blocks more often than I’d like to and have to go through the whole process each time that happens.

I also searched for that, I found GRAFT which seemed to be a useful debugging tool, but more for suitable for testing on real input and not to quickly test if the code runs at all on a small input graph.

After searching through this mailing list archive, there were a few references to running a giraph job with one click in eclipse aswell (see also [1]), but most descriptions were very vague and I could not reproduce them.

 

Lastly, one small question: In my first algorithm I had a small bug where I would use getVertexValue, then change the java object but not call setVertexValue which resulted in my changes not being saved leading to undesired behavior. After reading through another giraph algorithm, I noticed that they do the same (maybe it was with an EdgeValue, I’m not 100% sure on that) and don’t call setXValue, but apparently their code works. Can anybody shed some light on that? (I understand why it’s useful to have an explicit setVertexValue method for writing/reading vertices to/from disk, I just don’t understand why it is not necessary for them?)

 

Thanks,

Jan

 

 

[1] http://ben-tech.blogspot.in/2011/08/how-to-debug-hadoop-mapreduce-jobs-in.html

 


Re: Questions after writing some giraph code

Posted by Sergey Edunov <ed...@gmail.com>.
Hi Jan,

It's a bit hard to advise without seeing actual code, so my reply might
seem too generic. Feel free to send specific questions with code samples to
get more detailed advice.

"Is there a way to do this without duplicating a lot of code of the
“standalone” MasterCompute class of algorithm" - this sort of thing is
usually done by abstracting your algorithm out of giraph-related classes.
You can have a class with static methods that implement your algorithm and
then all you need to do is to pass data from vertices or master compute
into this class. This approach has other benefits such as easy testability.
E.g. you can write unit tests for algorithm.

"I need to register n aggregators (n = number of vertices)" - this is
generally a bad sign. How much data you want to store in aggregators?
Remeber, they will be send other the wire between workers and master. You
can, of course, get around by registering single Map as aggregator. You'll
need to wrap it into another class that implements Writable and then
implement readFields and write functions.

"My biggest problem right now is debugging" - unit tests are usually very
good approach. Again, depends a lot on your configuration. We also use
downsampled graphs a lot to quickly test on cluster.

" After reading through another giraph algorithm, I noticed that they do
the same" - can you point to the example? I suspect that's because they use
mutable data types or some helper functions to change value.

Regards,
Sergey Edunov


On Mon, Jul 6, 2015 at 8:18 AM, Jan Ebbing <Ja...@gmx.de> wrote:

> Hello everybody,
>
>
>
> first of all thank you for all your work on giraph.
>
> I’m a student writing his bachelor thesis using giraph.
>
> I have already implemented an algorithm that isn’t completely trivial,
> with my new task, however, I’m running into problems.
>
>
>
> I implemented my first algorithm as a subclass of MasterCompute where I
> register aggregators etc and then do a switch on the superstep and call
> setComputation() with the appropriate AbstractComputation subclass I wrote.
> (I hope this is how you’re supposed to do it)
>
> Now I want to implement a new algorithm that calls my new algorithm as a
> subroutine: While the halting condition is not fulfilled, I first partition
> the graph using algorithm 1, then continue to work on it with algorithm 1.
> Is there a way to do this without duplicating a lot of code of the
> “standalone” MasterCompute class of algorithm 1? An additional problem that
> arises with this is that the Vertex- and EdgeValue classes would have to be
> exactly the same. I tried to work around this by defining a writable
> interface that defines all the methods needed for my algorithms which would
> make the actual classes exchangeable. However, giraph uses reflection to
> create a new instance of exactly the type the computation uses (an
> interface), which leads to an error. Is there any good way to do this or
> are giraph jobs not that flexible?
>
>
>
> In a variation of the first algorithm, I need to register n aggregators (n
> = number of vertices). However, the documentation reads like I have to
> register aggregators in MasterCompute.initialize() (nowhere else) and at
> that point, getTotalNumVertices() does not work yet. I am aware that this
> is a very costly operation (I do not use all of those aggregators but I do
> not know beforehand which of them I will use and which I will not use), but
> currently the only workaround I can think of is using
> Configuration.getInt(). (and writing the total number of vertices in such a
> config file beforehand)
>
> Also a question on the config files: I understood that there are several
> of them that overwrite each other and that e.g. I should not change the
> core-default.xml or core-site.xml since they will change my complete Hadoop
> installation. I also read somewhere that it’s possible to write a config
> file just for one job (which would be what I need) but I never found out
> how that file should be called/where I have to place it.
>
>
>
> My biggest problem right now is debugging, though: Is there an easy way to
> test giraph code on small sample graphs? Right now, to test my
> implementations I have to package my code with maven, copy the long command
> into the terminal to run the giraph job (changing the output folder since
> they have to be different each time), wait a few minutes for the job to
> complete, open the web GUI, click through a few pages there until I see my
> debug statements/if the job completed I have to run through a text file via
> the console. Compared to what I was used to (1 click in eclipse and almost
> instantly seeing the output on the console) this is very annoying,
> especially since I do dumb small mistakes like switching the if and else
> blocks more often than I’d like to and have to go through the whole process
> each time that happens.
>
> I also searched for that, I found GRAFT which seemed to be a useful
> debugging tool, but more for suitable for testing on real input and not to
> quickly test if the code runs at all on a small input graph.
>
> After searching through this mailing list archive, there were a few
> references to running a giraph job with one click in eclipse aswell (see
> also [1]), but most descriptions were very vague and I could not reproduce
> them.
>
>
>
> Lastly, one small question: In my first algorithm I had a small bug where
> I would use getVertexValue, then change the java object but not call
> setVertexValue which resulted in my changes not being saved leading to
> undesired behavior. After reading through another giraph algorithm, I
> noticed that they do the same (maybe it was with an EdgeValue, I’m not 100%
> sure on that) and don’t call setXValue, but apparently their code works.
> Can anybody shed some light on that? (I understand why it’s useful to have
> an explicit setVertexValue method for writing/reading vertices to/from
> disk, I just don’t understand why it is not necessary for them?)
>
>
>
> Thanks,
>
> Jan
>
>
>
>
>
> [1]
> http://ben-tech.blogspot.in/2011/08/how-to-debug-hadoop-mapreduce-jobs-in.html
>