You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Paolo Castagna <ca...@googlemail.com> on 2012/05/17 16:23:48 UTC

SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Hi,
I noticed that the SimplePageRankVertex implementation does not deal with
dangling nodes (dangling nodes are those nodes with no outgoing links).
And, probably that is one of the good reasons why there is a "Simple" in front. ;-)

 "Dangling nodes exist in many forms. For example, a page of data, a page with a
postscript graph, [...], a page that has been fetched by a crawler but not yet
explored - these are all examples of possible dangling nodes. The more ambitious
the crawl, the bigger the proportion of dangling nodes because the set of
fetched but uncrawled pages grows quickly." (pag. 80) [1]

Wikipedia's PageRank page says:

 "If a page has no links to other pages, it becomes a sink and therefore
terminates the random surfing process. If the random surfer arrives at a sink
page, it picks another URL at random and continues surfing again. When
calculating PageRank, pages with no outbound links are assumed to link out to
all other pages in the collection. Their PageRank scores are therefore divided
evenly among all other pages. In other words, to be fair with pages that are not
sinks, these random transitions are added to all nodes in the Web, with a
residual probability usually set to d = 0.85, estimated from the frequency that
an average surfer uses his or her browser's bookmark feature." [2]

Now, if I want to try to account for the dangling nodes I need to sum all
PageRank values for the dangling nodes and divide it evenly among all other
pages (which in practice means to send a message to all vertexes in Giraph...
which is not possible, as it would be crazy with large graphs).

I imagine one can use a SumAggregator and alternate computing contribution from
dangling nodes and PageRank values (i.e. PageRank values are computed only every
2 supersteps).

What do you think?

Paolo

 [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
Science of Search Engine Rankings" (2006)
 [2] http://en.wikipedia.org/wiki/PageRank

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi Sebastian

Sebastian Schelter wrote:
> Accumulation should be possible with aggregators and the idea with
> applying preference vectors seems very interesting too.

Indeed.

But...

 - How big is the preference vector? (it's N elements, where N is the number of vertexes in your graph, right?)
 - How do we load that in (the Aggregator)? Can we do it just once at the beginning?
 - How user can determine the preference vector? (this is not something we need to be too much worried about, it's a user-land problem...)
 - ...

So, I think dealing properly with dangling nodes is much more important first step.

> Maybe these should be implemented as additional features in GIRAPH-191?

Yeah, it would be good to have the capabilities in a RandomWalkVertex (as you showed) and maybe use that in a 'proper' PageRankVertex implementation (more configurable and which deals with dangling
nodes and/or iterating until convergence is reached...)

Paolo

> 
> On 18.05.2012 11:24, Gianmarco De Francisci Morales wrote:
>> Could you just accumulate the PR of dangling nodes in a variable and divide
>> it equally in the next iteration?
>> If you use a preference vector instead of dividing it equally, then you
>> have also the strongly preferential (and weakly preferential in case you
>> divide equally) versions of RWR.
>>
>> Cheers,
>> --
>> Gianmarco
>>
>>
>>
>>
>> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna <
>> castagna.lists@googlemail.com> wrote:
>>
>>> Sebastian Schelter wrote:
>>>> You are right, the code would throw an exception here, but I guess it
>>>> would be sufficient to simply add a check here and not send any messages.
>>> That would avoid throwing an exception, but I am not sure it is actually
>>> the best thing to do.
>>> Some suggest to divide the PageRank scores of dangling nodes evenly among
>>> all other pages.
>>> If you want, this is equivalent to another random jump, but forced by the
>>> fact that there are no outgoing links to select from.
>>> Using your notation below:
>>>
>>>  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>>> sumOfDanlingNodesRanks / getNumVertices() )
>>>
>>> See also pag. 38 of the book ""Google's PageRank and Beyond":
>>> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>>>
>>> In other works, in pseudo code a serial implementation which accounts for
>>> dangling nodes should be:
>>>
>>> --------
>>> dangling_nodes_ranks = 0
>>> foreach node in dangling_node
>>>    dangling_nodes_ranks += ranks.get ( node )
>>> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>>>
>>> foeach node1 in graph
>>>    r = 0
>>>    foreach node2 in incoming_links ( node1 )
>>>        r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>>>
>>> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N
>>> --------
>>>
>>> What do you think?
>>>
>>> Paolo
>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 17.05.2012 23:44, Paolo Castagna wrote:
>>>>> Hi Sebastian,
>>>>> from SimplePageRankVertex.java, we have:
>>>>>
>>>>>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>>>>       long edges = getNumOutEdges();
>>>>>       sendMsgToAllEdges(
>>>>>           new DoubleWritable(getVertexValue().get() / edges));
>>>>>     } else {
>>>>>       voteToHalt();
>>>>>     }
>>>>>
>>>>> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
>>>>>
>>>>> Paolo
>>>>>
>>>>> Sebastian Schelter wrote:
>>>>>> The implementation implicitly deals with dangling nodes by computing
>>> the
>>>>>> pageRank as
>>>>>>
>>>>>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>>>>>
>>>>>> Conceptually, this is the same as connecting every vertex with every
>>>>>> other vertex it is not yet connected to. This removes all dangling
>>> nodes
>>>>>> from the graph as every vertex can reach every other vertex.
>>>>>>
>>>>>> "Mining of Massive Datasets" [1] has a very well written explanation of
>>>>>> this approach.
>>>>>>
>>>>>> --sebastian
>>>>>>
>>>>>>
>>>>>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>>>>>
>>>>>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>>>>>> Hi,
>>>>>>> I noticed that the SimplePageRankVertex implementation does not deal
>>> with
>>>>>>> dangling nodes (dangling nodes are those nodes with no outgoing
>>> links).
>>>>>>> And, probably that is one of the good reasons why there is a "Simple"
>>> in front. ;-)
>>>>>>>  "Dangling nodes exist in many forms. For example, a page of data, a
>>> page with a
>>>>>>> postscript graph, [...], a page that has been fetched by a crawler
>>> but not yet
>>>>>>> explored - these are all examples of possible dangling nodes. The
>>> more ambitious
>>>>>>> the crawl, the bigger the proportion of dangling nodes because the
>>> set of
>>>>>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>>>>>
>>>>>>> Wikipedia's PageRank page says:
>>>>>>>
>>>>>>>  "If a page has no links to other pages, it becomes a sink and
>>> therefore
>>>>>>> terminates the random surfing process. If the random surfer arrives
>>> at a sink
>>>>>>> page, it picks another URL at random and continues surfing again. When
>>>>>>> calculating PageRank, pages with no outbound links are assumed to
>>> link out to
>>>>>>> all other pages in the collection. Their PageRank scores are
>>> therefore divided
>>>>>>> evenly among all other pages. In other words, to be fair with pages
>>> that are not
>>>>>>> sinks, these random transitions are added to all nodes in the Web,
>>> with a
>>>>>>> residual probability usually set to d = 0.85, estimated from the
>>> frequency that
>>>>>>> an average surfer uses his or her browser's bookmark feature." [2]
>>>>>>>
>>>>>>> Now, if I want to try to account for the dangling nodes I need to sum
>>> all
>>>>>>> PageRank values for the dangling nodes and divide it evenly among all
>>> other
>>>>>>> pages (which in practce means to send a message to all vertexes in
>>> Giraph...
>>>>>>> which is not possible, as it would be crazy with large graphs).
>>>>>>>
>>>>>>> I imagine one can use a SumAggregator and alternate computing
>>> contribution from
>>>>>>> dangling nodes and PageRank values (i.e. PageRank values are computed
>>> only every
>>>>>>> 2 supersteps).
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Paolo
>>>>>>>
>>>>>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
>>> Beyond: The
>>>>>>> Science of Search Engine Rankings" (2006)
>>>>>>>  [2] http://en.wikipedia.org/wiki/PageRank
> 

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
Accumulation should be possible with aggregators and the idea with
applying preference vectors seems very interesting too.

Maybe these should be implemented as additional features in GIRAPH-191?

On 18.05.2012 11:24, Gianmarco De Francisci Morales wrote:
> Could you just accumulate the PR of dangling nodes in a variable and divide
> it equally in the next iteration?
> If you use a preference vector instead of dividing it equally, then you
> have also the strongly preferential (and weakly preferential in case you
> divide equally) versions of RWR.
> 
> Cheers,
> --
> Gianmarco
> 
> 
> 
> 
> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna <
> castagna.lists@googlemail.com> wrote:
> 
>> Sebastian Schelter wrote:
>>> You are right, the code would throw an exception here, but I guess it
>>> would be sufficient to simply add a check here and not send any messages.
>>
>> That would avoid throwing an exception, but I am not sure it is actually
>> the best thing to do.
>> Some suggest to divide the PageRank scores of dangling nodes evenly among
>> all other pages.
>> If you want, this is equivalent to another random jump, but forced by the
>> fact that there are no outgoing links to select from.
>> Using your notation below:
>>
>>  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>> sumOfDanlingNodesRanks / getNumVertices() )
>>
>> See also pag. 38 of the book ""Google's PageRank and Beyond":
>> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>>
>> In other works, in pseudo code a serial implementation which accounts for
>> dangling nodes should be:
>>
>> --------
>> dangling_nodes_ranks = 0
>> foreach node in dangling_node
>>    dangling_nodes_ranks += ranks.get ( node )
>> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>>
>> foeach node1 in graph
>>    r = 0
>>    foreach node2 in incoming_links ( node1 )
>>        r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>>
>> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N
>> --------
>>
>> What do you think?
>>
>> Paolo
>>
>>>
>>> --sebastian
>>>
>>>
>>> On 17.05.2012 23:44, Paolo Castagna wrote:
>>>> Hi Sebastian,
>>>> from SimplePageRankVertex.java, we have:
>>>>
>>>>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>>>       long edges = getNumOutEdges();
>>>>       sendMsgToAllEdges(
>>>>           new DoubleWritable(getVertexValue().get() / edges));
>>>>     } else {
>>>>       voteToHalt();
>>>>     }
>>>>
>>>> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
>>>>
>>>> Paolo
>>>>
>>>> Sebastian Schelter wrote:
>>>>> The implementation implicitly deals with dangling nodes by computing
>> the
>>>>> pageRank as
>>>>>
>>>>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>>>>
>>>>> Conceptually, this is the same as connecting every vertex with every
>>>>> other vertex it is not yet connected to. This removes all dangling
>> nodes
>>>>> from the graph as every vertex can reach every other vertex.
>>>>>
>>>>> "Mining of Massive Datasets" [1] has a very well written explanation of
>>>>> this approach.
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>>>>
>>>>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>>>>> Hi,
>>>>>> I noticed that the SimplePageRankVertex implementation does not deal
>> with
>>>>>> dangling nodes (dangling nodes are those nodes with no outgoing
>> links).
>>>>>> And, probably that is one of the good reasons why there is a "Simple"
>> in front. ;-)
>>>>>>
>>>>>>  "Dangling nodes exist in many forms. For example, a page of data, a
>> page with a
>>>>>> postscript graph, [...], a page that has been fetched by a crawler
>> but not yet
>>>>>> explored - these are all examples of possible dangling nodes. The
>> more ambitious
>>>>>> the crawl, the bigger the proportion of dangling nodes because the
>> set of
>>>>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>>>>
>>>>>> Wikipedia's PageRank page says:
>>>>>>
>>>>>>  "If a page has no links to other pages, it becomes a sink and
>> therefore
>>>>>> terminates the random surfing process. If the random surfer arrives
>> at a sink
>>>>>> page, it picks another URL at random and continues surfing again. When
>>>>>> calculating PageRank, pages with no outbound links are assumed to
>> link out to
>>>>>> all other pages in the collection. Their PageRank scores are
>> therefore divided
>>>>>> evenly among all other pages. In other words, to be fair with pages
>> that are not
>>>>>> sinks, these random transitions are added to all nodes in the Web,
>> with a
>>>>>> residual probability usually set to d = 0.85, estimated from the
>> frequency that
>>>>>> an average surfer uses his or her browser's bookmark feature." [2]
>>>>>>
>>>>>> Now, if I want to try to account for the dangling nodes I need to sum
>> all
>>>>>> PageRank values for the dangling nodes and divide it evenly among all
>> other
>>>>>> pages (which in practce means to send a message to all vertexes in
>> Giraph...
>>>>>> which is not possible, as it would be crazy with large graphs).
>>>>>>
>>>>>> I imagine one can use a SumAggregator and alternate computing
>> contribution from
>>>>>> dangling nodes and PageRank values (i.e. PageRank values are computed
>> only every
>>>>>> 2 supersteps).
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Paolo
>>>>>>
>>>>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
>> Beyond: The
>>>>>> Science of Search Engine Rankings" (2006)
>>>>>>  [2] http://en.wikipedia.org/wiki/PageRank
>>>
>>
> 


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Sebastian Schelter wrote:
> However, the problem with this input is that the dangling vertices that
> don't have a line of their own (such as 11) cannot contribute their
> accumulated rank, as no vertex for them will be instantiated. So
> counting them doesn't help either.

No, the 'implicit' dangling nodes (such as 6, 7, 9 and 11 below) are
instantiated when you send a message to them. If you run the example,
you'll see that after the first superstep there are 11 vertices which
are sending and receiving messages (as it should be with correct input).

> I think that we should rely on users supplying valid input (a line for
> each vertex) and not try to correct for that in the vertex class.

Well, I don't disagree in principle.

But in practice this won't stop users making mistakes and provide your
software with bad data as input. :-) One superstep for cleaning/validating
the input data isn't that bad after all.

> Creating a line for each vertex from such a file is an easy task that is
> doable with a single MapReduce pass over the data beforehand.

Sure. (Why is this better than a superstep with Giraph?) ;-)

Paolo



Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Paolo,

I see what you mean.

However, the problem with this input is that the dangling vertices that
don't have a line of their own (such as 11) cannot contribute their
accumulated rank, as no vertex for them will be instantiated. So
counting them doesn't help either.

I think that we should rely on users supplying valid input (a line for
each vertex) and not try to correct for that in the vertex class.

Creating a line for each vertex from such a file is an easy task that is
doable with a single MapReduce pass over the data beforehand.

--sebastian


On 28.05.2012 22:13, Paolo Castagna wrote:
> Hi Sebastian,
> you can try yourself with some simple input (which contains dangling nodes).
> 
> For example, say I have this adjacency list (with mistakes/repetitions and
> self-links, ignore those):
> 
> 1 1 2 2
> 2 3 5 7 9 11
> 3 3 3 6
> 4
> 5 1 2 3 11
> 8 10
> 10 5
> 
> How many vertices? 7 or 11?
> 
> I think this graph has 11 and it's 11 you need to use as number of vertices
> when you compute PageRank.
> 
> Paolo
> 
> Sebastian Schelter wrote:
>> Hi Paolo,
>>
>> Why would getNumVertices() not give back the correct number of vertices?
>> This call should always give back the overall number of vertices (if it
>> doesn't we have to fix it) and you shouldn't have to rely on tricks to
>> count stuff via aggregators.
>>
>> --sebastian
> 
> 


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi Sebastian,
you can try yourself with some simple input (which contains dangling nodes).

For example, say I have this adjacency list (with mistakes/repetitions and
self-links, ignore those):

1 1 2 2
2 3 5 7 9 11
3 3 3 6
4
5 1 2 3 11
8 10
10 5

How many vertices? 7 or 11?

I think this graph has 11 and it's 11 you need to use as number of vertices
when you compute PageRank.

Paolo

Sebastian Schelter wrote:
> Hi Paolo,
> 
> Why would getNumVertices() not give back the correct number of vertices?
> This call should always give back the overall number of vertices (if it
> doesn't we have to fix it) and you shouldn't have to rely on tricks to
> count stuff via aggregators.
> 
> --sebastian



Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Paolo,

Why would getNumVertices() not give back the correct number of vertices?
This call should always give back the overall number of vertices (if it
doesn't we have to fix it) and you shouldn't have to rely on tricks to
count stuff via aggregators.

--sebastian



On 28.05.2012 21:20, Paolo Castagna wrote:
> Hi Sebastian
> 
> Sebastian Schelter wrote:
>> I think the code can be improved partially. You don't have to count the
>> vertices via an aggregator, you can simply use getNumVertices().
> 
> No, you cannot use getNumVertices() since it won't count dangling nodes.
> If you want to count for dangling nodes you need to send a message to all
> nodes.
> 
>> Why do you only recompute the pageRank in each second superstep? Can we
>> not use the aggregated value of the dangling nodes from the last superstep?
> 
> Maybe, but I did not managed to do it.
> 
> The dangling node contribution needs to be recomputed each time, so the
> aggregator needs to be reset to 0 each time and kept unchanged during each
> superstep. The value is the sum of the PageRank values of all dangling nodes
> in the previous superstep.
> 
> The good thing is that now there is a 'correct' implementation and a test
> we can use to verify that PageRank values are correct (against a third
> party implementation: JUNG).
> 
>> Overall I think we're on a good way to a robust, real-world PageRank
>> implementation, I managed to implement the convergence check with an
>> aggregator, will post an updated patch soon.
> 
> Thanks, that was my next step. :-)
> + dumping factor configurable and general clean-up.
> 
> Paolo
> 
>>
>> Best,
>> Sebastian
>>
>>
>> On 28.05.2012 18:39, Paolo Castagna wrote:
>>> Paolo Castagna wrote:
>>>> Sebastian Schelter wrote:
>>>>> I guess that summing up and redistributing the pagerank of dangling
>>>>> vertices can also be done without an extra superstep in an aggregator.
>>>> Yeah! Why didn't I think about that?
>>>> Thanks, great suggestion.
>>>>
>>>> I am going to give this a go, at first without extending RandomWalkVertex since I want to see how it might work.
>>>> This would also inform design and improvements of the current RandomWalkVertex.
>>> You can find a 'proper' ;-) implementation of PageRank with dangling nodes
>>> support, sum at the end of all PageRank values equals to 1.00 (since it's a
>>> probability distribution) and PageRank values validated against a third party
>>> implementation (i.e. JUNG), here [1].
>>>
>>> I have not managed to do it as Sebastian suggested yet.
>>>
>>> Paolo
>>>
>>>  [1]
>>> https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java
>>>
>>


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi Sebastian

Sebastian Schelter wrote:
> I think the code can be improved partially. You don't have to count the
> vertices via an aggregator, you can simply use getNumVertices().

No, you cannot use getNumVertices() since it won't count dangling nodes.
If you want to count for dangling nodes you need to send a message to all
nodes.

> Why do you only recompute the pageRank in each second superstep? Can we
> not use the aggregated value of the dangling nodes from the last superstep?

Maybe, but I did not managed to do it.

The dangling node contribution needs to be recomputed each time, so the
aggregator needs to be reset to 0 each time and kept unchanged during each
superstep. The value is the sum of the PageRank values of all dangling nodes
in the previous superstep.

The good thing is that now there is a 'correct' implementation and a test
we can use to verify that PageRank values are correct (against a third
party implementation: JUNG).

> Overall I think we're on a good way to a robust, real-world PageRank
> implementation, I managed to implement the convergence check with an
> aggregator, will post an updated patch soon.

Thanks, that was my next step. :-)
+ dumping factor configurable and general clean-up.

Paolo

> 
> Best,
> Sebastian
> 
> 
> On 28.05.2012 18:39, Paolo Castagna wrote:
>> Paolo Castagna wrote:
>>> Sebastian Schelter wrote:
>>>> I guess that summing up and redistributing the pagerank of dangling
>>>> vertices can also be done without an extra superstep in an aggregator.
>>> Yeah! Why didn't I think about that?
>>> Thanks, great suggestion.
>>>
>>> I am going to give this a go, at first without extending RandomWalkVertex since I want to see how it might work.
>>> This would also inform design and improvements of the current RandomWalkVertex.
>> You can find a 'proper' ;-) implementation of PageRank with dangling nodes
>> support, sum at the end of all PageRank values equals to 1.00 (since it's a
>> probability distribution) and PageRank values validated against a third party
>> implementation (i.e. JUNG), here [1].
>>
>> I have not managed to do it as Sebastian suggested yet.
>>
>> Paolo
>>
>>  [1]
>> https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java
>>
> 

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Avery Ching <ac...@apache.org>.
We did have a related issue 
(https://issues.apache.org/jira/browse/GIRAPH-155).

On 5/29/12 6:54 AM, Claudio Martella wrote:
> I'm not sure they will be needed to send them on the first superstep.
> They'll be created and used in the second superstep if necessary. If
> they need it in the first superstep, then i guess they'll put them as
> a line in the inputfile.
> I agree with you that this is kind of messed up :)
>
>
> On Tue, May 29, 2012 at 3:23 PM, Sebastian Schelter<ss...@apache.org>  wrote:
>> Oh sorry, I didn't know that discussion. The problem I see is that in
>> every implementation, a user might run into this issue, and I don't
>> think its ideal to force users to always run a round of sending empty
>> messages at the beginning.
>>
>> Maybe the system should (somehow) automagically do that for the users?
>> Really seems to be an awkward situation though...
>>
>> --sebastian
>>
>>
>>
>> On 29.05.2012 15:03, Claudio Martella wrote:
>>> About the mapreduce job to prepare the inputset, I did advocate for
>>> this solution instead of supporting automatic creation of non-existent
>>> vertices implicitly (which I believe adds a logical path in vertex
>>> resolution which has some drawbacks e.g you have to check in the
>>> hashmap for the existence of the destination vertex for each message,
>>> which is "fine" now that it's a hashmap, but it's going to be less
>>> fine when/if we turn to TreeMap for out-of-core).
>>>
>>> Unfortunately the other committers preferred going for the path that
>>> helps userland's life, so I guess this solution is not to be
>>> considered here either.
>>>
>>> On Tue, May 29, 2012 at 1:48 PM, Sebastian Schelter<ss...@apache.org>  wrote:
>>>> On 29.05.2012 13:13, Paolo Castagna wrote:
>>>>> Hi Sebastian
>>>>>
>>>>> Sebastian Schelter wrote:
>>>>>> Why do you only recompute the pageRank in each second superstep? Can we
>>>>>> not use the aggregated value of the dangling nodes from the last superstep?
>>>>> I removed the computing of PageRank values every each second superstep.
>>>>> However, I needed to use a couple of aggregators for the dangling nodes
>>>>> contribution instead of just one: "dangling-current" and "dangling-previous".
>>>>>
>>>>> Each superstep, I need to reset the dangling-current aggregator, at the
>>>>> same time, I need to know the value of the aggregator at a previous
>>>>> superstep.
>>>> You can save the value from the previous step in a static variable in
>>>> the WorkerContext before resetting the aggregator.
>>>>
>>>>> I hope it makes sense, let me know if you have a better idea.
>>>>>
>>>>>> Overall I think we're on a good way to a robust, real-world PageRank
>>>>>> implementation, I managed to implement the convergence check with an
>>>>>> aggregator, will post an updated patch soon.
>>>>> I think I've just done it, have a look [1] and let me know if you would have
>>>>> done it differently.
>>>>>
>>>>> Paolo
>>>>>
>>>>>   [1]
>>>>> https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90
>>>>>
>>>>>
>>>
>>>
>
>


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Claudio Martella <cl...@gmail.com>.
I'm not sure they will be needed to send them on the first superstep.
They'll be created and used in the second superstep if necessary. If
they need it in the first superstep, then i guess they'll put them as
a line in the inputfile.
I agree with you that this is kind of messed up :)


On Tue, May 29, 2012 at 3:23 PM, Sebastian Schelter <ss...@apache.org> wrote:
> Oh sorry, I didn't know that discussion. The problem I see is that in
> every implementation, a user might run into this issue, and I don't
> think its ideal to force users to always run a round of sending empty
> messages at the beginning.
>
> Maybe the system should (somehow) automagically do that for the users?
> Really seems to be an awkward situation though...
>
> --sebastian
>
>
>
> On 29.05.2012 15:03, Claudio Martella wrote:
>> About the mapreduce job to prepare the inputset, I did advocate for
>> this solution instead of supporting automatic creation of non-existent
>> vertices implicitly (which I believe adds a logical path in vertex
>> resolution which has some drawbacks e.g you have to check in the
>> hashmap for the existence of the destination vertex for each message,
>> which is "fine" now that it's a hashmap, but it's going to be less
>> fine when/if we turn to TreeMap for out-of-core).
>>
>> Unfortunately the other committers preferred going for the path that
>> helps userland's life, so I guess this solution is not to be
>> considered here either.
>>
>> On Tue, May 29, 2012 at 1:48 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>> On 29.05.2012 13:13, Paolo Castagna wrote:
>>>> Hi Sebastian
>>>>
>>>> Sebastian Schelter wrote:
>>>>> Why do you only recompute the pageRank in each second superstep? Can we
>>>>> not use the aggregated value of the dangling nodes from the last superstep?
>>>>
>>>> I removed the computing of PageRank values every each second superstep.
>>>> However, I needed to use a couple of aggregators for the dangling nodes
>>>> contribution instead of just one: "dangling-current" and "dangling-previous".
>>>>
>>>> Each superstep, I need to reset the dangling-current aggregator, at the
>>>> same time, I need to know the value of the aggregator at a previous
>>>> superstep.
>>>
>>> You can save the value from the previous step in a static variable in
>>> the WorkerContext before resetting the aggregator.
>>>
>>>>
>>>> I hope it makes sense, let me know if you have a better idea.
>>>>
>>>>> Overall I think we're on a good way to a robust, real-world PageRank
>>>>> implementation, I managed to implement the convergence check with an
>>>>> aggregator, will post an updated patch soon.
>>>>
>>>> I think I've just done it, have a look [1] and let me know if you would have
>>>> done it differently.
>>>>
>>>> Paolo
>>>>
>>>>  [1]
>>>> https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90
>>>>
>>>>
>>>
>>
>>
>>
>



-- 
   Claudio Martella
   claudio.martella@gmail.com

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
Oh sorry, I didn't know that discussion. The problem I see is that in
every implementation, a user might run into this issue, and I don't
think its ideal to force users to always run a round of sending empty
messages at the beginning.

Maybe the system should (somehow) automagically do that for the users?
Really seems to be an awkward situation though...

--sebastian



On 29.05.2012 15:03, Claudio Martella wrote:
> About the mapreduce job to prepare the inputset, I did advocate for
> this solution instead of supporting automatic creation of non-existent
> vertices implicitly (which I believe adds a logical path in vertex
> resolution which has some drawbacks e.g you have to check in the
> hashmap for the existence of the destination vertex for each message,
> which is "fine" now that it's a hashmap, but it's going to be less
> fine when/if we turn to TreeMap for out-of-core).
> 
> Unfortunately the other committers preferred going for the path that
> helps userland's life, so I guess this solution is not to be
> considered here either.
> 
> On Tue, May 29, 2012 at 1:48 PM, Sebastian Schelter <ss...@apache.org> wrote:
>> On 29.05.2012 13:13, Paolo Castagna wrote:
>>> Hi Sebastian
>>>
>>> Sebastian Schelter wrote:
>>>> Why do you only recompute the pageRank in each second superstep? Can we
>>>> not use the aggregated value of the dangling nodes from the last superstep?
>>>
>>> I removed the computing of PageRank values every each second superstep.
>>> However, I needed to use a couple of aggregators for the dangling nodes
>>> contribution instead of just one: "dangling-current" and "dangling-previous".
>>>
>>> Each superstep, I need to reset the dangling-current aggregator, at the
>>> same time, I need to know the value of the aggregator at a previous
>>> superstep.
>>
>> You can save the value from the previous step in a static variable in
>> the WorkerContext before resetting the aggregator.
>>
>>>
>>> I hope it makes sense, let me know if you have a better idea.
>>>
>>>> Overall I think we're on a good way to a robust, real-world PageRank
>>>> implementation, I managed to implement the convergence check with an
>>>> aggregator, will post an updated patch soon.
>>>
>>> I think I've just done it, have a look [1] and let me know if you would have
>>> done it differently.
>>>
>>> Paolo
>>>
>>>  [1]
>>> https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90
>>>
>>>
>>
> 
> 
> 


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Claudio Martella <cl...@gmail.com>.
About the mapreduce job to prepare the inputset, I did advocate for
this solution instead of supporting automatic creation of non-existent
vertices implicitly (which I believe adds a logical path in vertex
resolution which has some drawbacks e.g you have to check in the
hashmap for the existence of the destination vertex for each message,
which is "fine" now that it's a hashmap, but it's going to be less
fine when/if we turn to TreeMap for out-of-core).

Unfortunately the other committers preferred going for the path that
helps userland's life, so I guess this solution is not to be
considered here either.

On Tue, May 29, 2012 at 1:48 PM, Sebastian Schelter <ss...@apache.org> wrote:
> On 29.05.2012 13:13, Paolo Castagna wrote:
>> Hi Sebastian
>>
>> Sebastian Schelter wrote:
>>> Why do you only recompute the pageRank in each second superstep? Can we
>>> not use the aggregated value of the dangling nodes from the last superstep?
>>
>> I removed the computing of PageRank values every each second superstep.
>> However, I needed to use a couple of aggregators for the dangling nodes
>> contribution instead of just one: "dangling-current" and "dangling-previous".
>>
>> Each superstep, I need to reset the dangling-current aggregator, at the
>> same time, I need to know the value of the aggregator at a previous
>> superstep.
>
> You can save the value from the previous step in a static variable in
> the WorkerContext before resetting the aggregator.
>
>>
>> I hope it makes sense, let me know if you have a better idea.
>>
>>> Overall I think we're on a good way to a robust, real-world PageRank
>>> implementation, I managed to implement the convergence check with an
>>> aggregator, will post an updated patch soon.
>>
>> I think I've just done it, have a look [1] and let me know if you would have
>> done it differently.
>>
>> Paolo
>>
>>  [1]
>> https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90
>>
>>
>



-- 
   Claudio Martella
   claudio.martella@gmail.com

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
On 29.05.2012 13:13, Paolo Castagna wrote:
> Hi Sebastian
> 
> Sebastian Schelter wrote:
>> Why do you only recompute the pageRank in each second superstep? Can we
>> not use the aggregated value of the dangling nodes from the last superstep?
> 
> I removed the computing of PageRank values every each second superstep.
> However, I needed to use a couple of aggregators for the dangling nodes
> contribution instead of just one: "dangling-current" and "dangling-previous".
> 
> Each superstep, I need to reset the dangling-current aggregator, at the
> same time, I need to know the value of the aggregator at a previous
> superstep.

You can save the value from the previous step in a static variable in
the WorkerContext before resetting the aggregator.

> 
> I hope it makes sense, let me know if you have a better idea.
> 
>> Overall I think we're on a good way to a robust, real-world PageRank
>> implementation, I managed to implement the convergence check with an
>> aggregator, will post an updated patch soon.
> 
> I think I've just done it, have a look [1] and let me know if you would have
> done it differently.
> 
> Paolo
> 
>  [1]
> https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90
> 
> 


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi Sebastian

Sebastian Schelter wrote:
> Why do you only recompute the pageRank in each second superstep? Can we
> not use the aggregated value of the dangling nodes from the last superstep?

I removed the computing of PageRank values every each second superstep.
However, I needed to use a couple of aggregators for the dangling nodes
contribution instead of just one: "dangling-current" and "dangling-previous".

Each superstep, I need to reset the dangling-current aggregator, at the
same time, I need to know the value of the aggregator at a previous
superstep.

I hope it makes sense, let me know if you have a better idea.

> Overall I think we're on a good way to a robust, real-world PageRank
> implementation, I managed to implement the convergence check with an
> aggregator, will post an updated patch soon.

I think I've just done it, have a look [1] and let me know if you would have
done it differently.

Paolo

 [1]
https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90



Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
I think the code can be improved partially. You don't have to count the
vertices via an aggregator, you can simply use getNumVertices().

Why do you only recompute the pageRank in each second superstep? Can we
not use the aggregated value of the dangling nodes from the last superstep?

Overall I think we're on a good way to a robust, real-world PageRank
implementation, I managed to implement the convergence check with an
aggregator, will post an updated patch soon.

Best,
Sebastian


On 28.05.2012 18:39, Paolo Castagna wrote:
> Paolo Castagna wrote:
>> Sebastian Schelter wrote:
>>> I guess that summing up and redistributing the pagerank of dangling
>>> vertices can also be done without an extra superstep in an aggregator.
>>
>> Yeah! Why didn't I think about that?
>> Thanks, great suggestion.
>>
>> I am going to give this a go, at first without extending RandomWalkVertex since I want to see how it might work.
>> This would also inform design and improvements of the current RandomWalkVertex.
> 
> You can find a 'proper' ;-) implementation of PageRank with dangling nodes
> support, sum at the end of all PageRank values equals to 1.00 (since it's a
> probability distribution) and PageRank values validated against a third party
> implementation (i.e. JUNG), here [1].
> 
> I have not managed to do it as Sebastian suggested yet.
> 
> Paolo
> 
>  [1]
> https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java
> 


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi Sebastian

Sebastian Schelter wrote:
> Could we try to merge this with the patch from
> https://issues.apache.org/jira/browse/GIRAPH-191 ?

I'll look at doing that, but so far I did not managed to have
a correct implementation. Now, that I have an implementation
which I know to be correct, I'll try using RandomWalkVertex.

Paolo

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Paolo,

Could we try to merge this with the patch from
https://issues.apache.org/jira/browse/GIRAPH-191 ?

Best,
Sebastian

On 28.05.2012 18:39, Paolo Castagna wrote:
> Paolo Castagna wrote:
>> Sebastian Schelter wrote:
>>> I guess that summing up and redistributing the pagerank of dangling
>>> vertices can also be done without an extra superstep in an aggregator.
>>
>> Yeah! Why didn't I think about that?
>> Thanks, great suggestion.
>>
>> I am going to give this a go, at first without extending RandomWalkVertex since I want to see how it might work.
>> This would also inform design and improvements of the current RandomWalkVertex.
> 
> You can find a 'proper' ;-) implementation of PageRank with dangling nodes
> support, sum at the end of all PageRank values equals to 1.00 (since it's a
> probability distribution) and PageRank values validated against a third party
> implementation (i.e. JUNG), here [1].
> 
> I have not managed to do it as Sebastian suggested yet.
> 
> Paolo
> 
>  [1]
> https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java
> 


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Paolo Castagna wrote:
> Sebastian Schelter wrote:
>> I guess that summing up and redistributing the pagerank of dangling
>> vertices can also be done without an extra superstep in an aggregator.
> 
> Yeah! Why didn't I think about that?
> Thanks, great suggestion.
> 
> I am going to give this a go, at first without extending RandomWalkVertex since I want to see how it might work.
> This would also inform design and improvements of the current RandomWalkVertex.

You can find a 'proper' ;-) implementation of PageRank with dangling nodes
support, sum at the end of all PageRank values equals to 1.00 (since it's a
probability distribution) and PageRank values validated against a third party
implementation (i.e. JUNG), here [1].

I have not managed to do it as Sebastian suggested yet.

Paolo

 [1]
https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi Sebastian

Sebastian Schelter wrote:
> Very good points, could you add them to GIRAPH-191?

Done. Summarized with a link to this thread.

> I think that the convergence check does not need an extra superstep. It
> is sufficient to compute the L1-norm of the current pageRank vector,
> which can be done by an aggregator and compare it to the previous
> aggregated value.

Right. Much better. :-)

> I guess that summing up and redistributing the pagerank of dangling
> vertices can also be done without an extra superstep in an aggregator.

Yeah! Why didn't I think about that?
Thanks, great suggestion.

I am going to give this a go, at first without extending RandomWalkVertex since I want to see how it might work.
This would also inform design and improvements of the current RandomWalkVertex.

Thanks for your comments and suggestions.

Paolo

> 
> --sebastian
> 
> On 18.05.2012 12:31, Paolo Castagna wrote:
>> Hi Gianmarco
>>
>> Gianmarco De Francisci Morales wrote:
>>> Could you just accumulate the PR of dangling nodes in a variable and
>>> divide it equally in the next iteration?
>> Yep, this is exactly what I was thinking when I wrote: "I imagine one can use a SumAggregator and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank values are
>> computed only every 2 supersteps).
>>
>> You need to use an Aggregator and spend a superstep just to aggregate/sum the PageRank of all danling nodes.
>> Then, in the successive superstep you can use that divided by the number of nodes in the graph to equally among all other nodes (i.e. the sumOfDanlingNodesRanks / getNumVertices() below).
>>
>> Another, change/enhancement could be to check for convergence (to do that, you need to keep current and previous PageRank values) and spend another superstep just doing that.
>> But, as you can see things stop being simple and the point of SimplePageRankVertex is to offer a very simple example.
>> I did not started this thread with the aim to change SimplePageRankVertex, I think it should stay as it is (and as the Google's Pregel paper describes it).
>>
>> However, in practice, if you want to run PageRank or similar, you need to deal with problems such as: what to do with dangling nodes/pages? how to know how many iterations you need to reach the
>> precision you want/need? Etc.
>>
>> My initial message was to exchange some ideas and get some feedback and/or suggestions on how a not too simple PageRankVertex might look. ;-)
>>
>>> If you use a preference vector instead of dividing it equally, then you
>>> have also the strongly preferential (and weakly preferential in case you
>>> divide equally) versions of RWR.
>> Yep, but how you get your preference vector? ;-)
>>
>> As a first step, it would be great if we can come up with a PageRankVertex which takes into account dangling nodes and perhaps/optionally check for convergence every few supersteps.
>>
>> Anyway, thanks for your comments and feedback Gianmarco.
>>
>> Paolo
>>
>>> Cheers,
>>> --
>>> Gianmarco
>>>
>>>
>>>
>>>
>>> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna
>>> <castagna.lists@googlemail.com <ma...@googlemail.com>>
>>> wrote:
>>>
>>>     Sebastian Schelter wrote:
>>>     > You are right, the code would throw an exception here, but I guess it
>>>     > would be sufficient to simply add a check here and not send any
>>>     messages.
>>>
>>>     That would avoid throwing an exception, but I am not sure it is
>>>     actually the best thing to do.
>>>     Some suggest to divide the PageRank scores of dangling nodes evenly
>>>     among all other pages.
>>>     If you want, this is equivalent to another random jump, but forced
>>>     by the fact that there are no outgoing links to select from.
>>>     Using your notation below:
>>>
>>>      0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>>>     sumOfDanlingNodesRanks / getNumVertices() )
>>>
>>>     See also pag. 38 of the book ""Google's PageRank and Beyond":
>>>     http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>>>     <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38>
>>>
>>>     In other works, in pseudo code a serial implementation which
>>>     accounts for dangling nodes should be:
>>>
>>>     --------
>>>     dangling_nodes_ranks = 0
>>>     foreach node in dangling_node
>>>        dangling_nodes_ranks += ranks.get ( node )
>>>     dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>>>
>>>     foeach node1 in graph
>>>        r = 0
>>>        foreach node2 in incoming_links ( node1 )
>>>            r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>>>
>>>     r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor
>>>     ) / N
>>>     --------
>>>
>>>     What do you think?
>>>
>>>     Paolo
>>>
>>>     >
>>>     > --sebastian
>>>     >
>>>     >
>>>     > On 17.05.2012 23:44, Paolo Castagna wrote:
>>>     >> Hi Sebastian,
>>>     >> from SimplePageRankVertex.java, we have:
>>>     >>
>>>     >>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>>     >>       long edges = getNumOutEdges();
>>>     >>       sendMsgToAllEdges(
>>>     >>           new DoubleWritable(getVertexValue().get() / edges));
>>>     >>     } else {
>>>     >>       voteToHalt();
>>>     >>     }
>>>     >>
>>>     >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling
>>>     node)?
>>>     >>
>>>     >> Paolo
>>>     >>
>>>     >> Sebastian Schelter wrote:
>>>     >>> The implementation implicitly deals with dangling nodes by
>>>     computing the
>>>     >>> pageRank as
>>>     >>>
>>>     >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>>     >>>
>>>     >>> Conceptually, this is the same as connecting every vertex with every
>>>     >>> other vertex it is not yet connected to. This removes all
>>>     dangling nodes
>>>     >>> from the graph as every vertex can reach every other vertex.
>>>     >>>
>>>     >>> "Mining of Massive Datasets" [1] has a very well written
>>>     explanation of
>>>     >>> this approach.
>>>     >>>
>>>     >>> --sebastian
>>>     >>>
>>>     >>>
>>>     >>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>>     >>>
>>>     >>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>>     >>>> Hi,
>>>     >>>> I noticed that the SimplePageRankVertex implementation does not
>>>     deal with
>>>     >>>> dangling nodes (dangling nodes are those nodes with no outgoing
>>>     links).
>>>     >>>> And, probably that is one of the good reasons why there is a
>>>     "Simple" in front. ;-)
>>>     >>>>
>>>     >>>>  "Dangling nodes exist in many forms. For example, a page of
>>>     data, a page with a
>>>     >>>> postscript graph, [...], a page that has been fetched by a
>>>     crawler but not yet
>>>     >>>> explored - these are all examples of possible dangling nodes.
>>>     The more ambitious
>>>     >>>> the crawl, the bigger the proportion of dangling nodes because
>>>     the set of
>>>     >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>     >>>>
>>>     >>>> Wikipedia's PageRank page says:
>>>     >>>>
>>>     >>>>  "If a page has no links to other pages, it becomes a sink and
>>>     therefore
>>>     >>>> terminates the random surfing process. If the random surfer
>>>     arrives at a sink
>>>     >>>> page, it picks another URL at random and continues surfing
>>>     again. When
>>>     >>>> calculating PageRank, pages with no outbound links are assumed
>>>     to link out to
>>>     >>>> all other pages in the collection. Their PageRank scores are
>>>     therefore divided
>>>     >>>> evenly among all other pages. In other words, to be fair with
>>>     pages that are not
>>>     >>>> sinks, these random transitions are added to all nodes in the
>>>     Web, with a
>>>     >>>> residual probability usually set to d = 0.85, estimated from
>>>     the frequency that
>>>     >>>> an average surfer uses his or her browser's bookmark feature." [2]
>>>     >>>>
>>>     >>>> Now, if I want to try to account for the dangling nodes I need
>>>     to sum all
>>>     >>>> PageRank values for the dangling nodes and divide it evenly
>>>     among all other
>>>     >>>> pages (which in practce means to send a message to all vertexes
>>>     in Giraph...
>>>     >>>> which is not possible, as it would be crazy with large graphs).
>>>     >>>>
>>>     >>>> I imagine one can use a SumAggregator and alternate computing
>>>     contribution from
>>>     >>>> dangling nodes and PageRank values (i.e. PageRank values are
>>>     computed only every
>>>     >>>> 2 supersteps).
>>>     >>>>
>>>     >>>> What do you think?
>>>     >>>>
>>>     >>>> Paolo
>>>     >>>>
>>>     >>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
>>>     Beyond: The
>>>     >>>> Science of Search Engine Rankings" (2006)
>>>     >>>>  [2] http://en.wikipedia.org/wiki/PageRank
>>>     >
>>>
>>>
> 

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Paolo,

Very good points, could you add them to GIRAPH-191?

I think that the convergence check does not need an extra superstep. It
is sufficient to compute the L1-norm of the current pageRank vector,
which can be done by an aggregator and compare it to the previous
aggregated value.

I guess that summing up and redistributing the pagerank of dangling
vertices can also be done without an extra superstep in an aggregator.

--sebastian

On 18.05.2012 12:31, Paolo Castagna wrote:
> Hi Gianmarco
> 
> Gianmarco De Francisci Morales wrote:
>> Could you just accumulate the PR of dangling nodes in a variable and
>> divide it equally in the next iteration?
> 
> Yep, this is exactly what I was thinking when I wrote: "I imagine one can use a SumAggregator and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank values are
> computed only every 2 supersteps).
> 
> You need to use an Aggregator and spend a superstep just to aggregate/sum the PageRank of all danling nodes.
> Then, in the successive superstep you can use that divided by the number of nodes in the graph to equally among all other nodes (i.e. the sumOfDanlingNodesRanks / getNumVertices() below).
> 
> Another, change/enhancement could be to check for convergence (to do that, you need to keep current and previous PageRank values) and spend another superstep just doing that.
> But, as you can see things stop being simple and the point of SimplePageRankVertex is to offer a very simple example.
> I did not started this thread with the aim to change SimplePageRankVertex, I think it should stay as it is (and as the Google's Pregel paper describes it).
> 
> However, in practice, if you want to run PageRank or similar, you need to deal with problems such as: what to do with dangling nodes/pages? how to know how many iterations you need to reach the
> precision you want/need? Etc.
> 
> My initial message was to exchange some ideas and get some feedback and/or suggestions on how a not too simple PageRankVertex might look. ;-)
> 
>> If you use a preference vector instead of dividing it equally, then you
>> have also the strongly preferential (and weakly preferential in case you
>> divide equally) versions of RWR.
> 
> Yep, but how you get your preference vector? ;-)
> 
> As a first step, it would be great if we can come up with a PageRankVertex which takes into account dangling nodes and perhaps/optionally check for convergence every few supersteps.
> 
> Anyway, thanks for your comments and feedback Gianmarco.
> 
> Paolo
> 
>>
>> Cheers,
>> --
>> Gianmarco
>>
>>
>>
>>
>> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna
>> <castagna.lists@googlemail.com <ma...@googlemail.com>>
>> wrote:
>>
>>     Sebastian Schelter wrote:
>>     > You are right, the code would throw an exception here, but I guess it
>>     > would be sufficient to simply add a check here and not send any
>>     messages.
>>
>>     That would avoid throwing an exception, but I am not sure it is
>>     actually the best thing to do.
>>     Some suggest to divide the PageRank scores of dangling nodes evenly
>>     among all other pages.
>>     If you want, this is equivalent to another random jump, but forced
>>     by the fact that there are no outgoing links to select from.
>>     Using your notation below:
>>
>>      0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>>     sumOfDanlingNodesRanks / getNumVertices() )
>>
>>     See also pag. 38 of the book ""Google's PageRank and Beyond":
>>     http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>>     <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38>
>>
>>     In other works, in pseudo code a serial implementation which
>>     accounts for dangling nodes should be:
>>
>>     --------
>>     dangling_nodes_ranks = 0
>>     foreach node in dangling_node
>>        dangling_nodes_ranks += ranks.get ( node )
>>     dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>>
>>     foeach node1 in graph
>>        r = 0
>>        foreach node2 in incoming_links ( node1 )
>>            r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>>
>>     r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor
>>     ) / N
>>     --------
>>
>>     What do you think?
>>
>>     Paolo
>>
>>     >
>>     > --sebastian
>>     >
>>     >
>>     > On 17.05.2012 23:44, Paolo Castagna wrote:
>>     >> Hi Sebastian,
>>     >> from SimplePageRankVertex.java, we have:
>>     >>
>>     >>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>     >>       long edges = getNumOutEdges();
>>     >>       sendMsgToAllEdges(
>>     >>           new DoubleWritable(getVertexValue().get() / edges));
>>     >>     } else {
>>     >>       voteToHalt();
>>     >>     }
>>     >>
>>     >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling
>>     node)?
>>     >>
>>     >> Paolo
>>     >>
>>     >> Sebastian Schelter wrote:
>>     >>> The implementation implicitly deals with dangling nodes by
>>     computing the
>>     >>> pageRank as
>>     >>>
>>     >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>     >>>
>>     >>> Conceptually, this is the same as connecting every vertex with every
>>     >>> other vertex it is not yet connected to. This removes all
>>     dangling nodes
>>     >>> from the graph as every vertex can reach every other vertex.
>>     >>>
>>     >>> "Mining of Massive Datasets" [1] has a very well written
>>     explanation of
>>     >>> this approach.
>>     >>>
>>     >>> --sebastian
>>     >>>
>>     >>>
>>     >>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>     >>>
>>     >>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>     >>>> Hi,
>>     >>>> I noticed that the SimplePageRankVertex implementation does not
>>     deal with
>>     >>>> dangling nodes (dangling nodes are those nodes with no outgoing
>>     links).
>>     >>>> And, probably that is one of the good reasons why there is a
>>     "Simple" in front. ;-)
>>     >>>>
>>     >>>>  "Dangling nodes exist in many forms. For example, a page of
>>     data, a page with a
>>     >>>> postscript graph, [...], a page that has been fetched by a
>>     crawler but not yet
>>     >>>> explored - these are all examples of possible dangling nodes.
>>     The more ambitious
>>     >>>> the crawl, the bigger the proportion of dangling nodes because
>>     the set of
>>     >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>     >>>>
>>     >>>> Wikipedia's PageRank page says:
>>     >>>>
>>     >>>>  "If a page has no links to other pages, it becomes a sink and
>>     therefore
>>     >>>> terminates the random surfing process. If the random surfer
>>     arrives at a sink
>>     >>>> page, it picks another URL at random and continues surfing
>>     again. When
>>     >>>> calculating PageRank, pages with no outbound links are assumed
>>     to link out to
>>     >>>> all other pages in the collection. Their PageRank scores are
>>     therefore divided
>>     >>>> evenly among all other pages. In other words, to be fair with
>>     pages that are not
>>     >>>> sinks, these random transitions are added to all nodes in the
>>     Web, with a
>>     >>>> residual probability usually set to d = 0.85, estimated from
>>     the frequency that
>>     >>>> an average surfer uses his or her browser's bookmark feature." [2]
>>     >>>>
>>     >>>> Now, if I want to try to account for the dangling nodes I need
>>     to sum all
>>     >>>> PageRank values for the dangling nodes and divide it evenly
>>     among all other
>>     >>>> pages (which in practce means to send a message to all vertexes
>>     in Giraph...
>>     >>>> which is not possible, as it would be crazy with large graphs).
>>     >>>>
>>     >>>> I imagine one can use a SumAggregator and alternate computing
>>     contribution from
>>     >>>> dangling nodes and PageRank values (i.e. PageRank values are
>>     computed only every
>>     >>>> 2 supersteps).
>>     >>>>
>>     >>>> What do you think?
>>     >>>>
>>     >>>> Paolo
>>     >>>>
>>     >>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
>>     Beyond: The
>>     >>>> Science of Search Engine Rankings" (2006)
>>     >>>>  [2] http://en.wikipedia.org/wiki/PageRank
>>     >
>>
>>


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi Gianmarco

Gianmarco De Francisci Morales wrote:
> Could you just accumulate the PR of dangling nodes in a variable and
> divide it equally in the next iteration?

Yep, this is exactly what I was thinking when I wrote: "I imagine one can use a SumAggregator and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank values are
computed only every 2 supersteps).

You need to use an Aggregator and spend a superstep just to aggregate/sum the PageRank of all danling nodes.
Then, in the successive superstep you can use that divided by the number of nodes in the graph to equally among all other nodes (i.e. the sumOfDanlingNodesRanks / getNumVertices() below).

Another, change/enhancement could be to check for convergence (to do that, you need to keep current and previous PageRank values) and spend another superstep just doing that.
But, as you can see things stop being simple and the point of SimplePageRankVertex is to offer a very simple example.
I did not started this thread with the aim to change SimplePageRankVertex, I think it should stay as it is (and as the Google's Pregel paper describes it).

However, in practice, if you want to run PageRank or similar, you need to deal with problems such as: what to do with dangling nodes/pages? how to know how many iterations you need to reach the
precision you want/need? Etc.

My initial message was to exchange some ideas and get some feedback and/or suggestions on how a not too simple PageRankVertex might look. ;-)

> If you use a preference vector instead of dividing it equally, then you
> have also the strongly preferential (and weakly preferential in case you
> divide equally) versions of RWR.

Yep, but how you get your preference vector? ;-)

As a first step, it would be great if we can come up with a PageRankVertex which takes into account dangling nodes and perhaps/optionally check for convergence every few supersteps.

Anyway, thanks for your comments and feedback Gianmarco.

Paolo

> 
> Cheers,
> --
> Gianmarco
> 
> 
> 
> 
> On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna
> <castagna.lists@googlemail.com <ma...@googlemail.com>>
> wrote:
> 
>     Sebastian Schelter wrote:
>     > You are right, the code would throw an exception here, but I guess it
>     > would be sufficient to simply add a check here and not send any
>     messages.
> 
>     That would avoid throwing an exception, but I am not sure it is
>     actually the best thing to do.
>     Some suggest to divide the PageRank scores of dangling nodes evenly
>     among all other pages.
>     If you want, this is equivalent to another random jump, but forced
>     by the fact that there are no outgoing links to select from.
>     Using your notation below:
> 
>      0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
>     sumOfDanlingNodesRanks / getNumVertices() )
> 
>     See also pag. 38 of the book ""Google's PageRank and Beyond":
>     http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>     <http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38>
> 
>     In other works, in pseudo code a serial implementation which
>     accounts for dangling nodes should be:
> 
>     --------
>     dangling_nodes_ranks = 0
>     foreach node in dangling_node
>        dangling_nodes_ranks += ranks.get ( node )
>     dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
> 
>     foeach node1 in graph
>        r = 0
>        foreach node2 in incoming_links ( node1 )
>            r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
> 
>     r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor
>     ) / N
>     --------
> 
>     What do you think?
> 
>     Paolo
> 
>     >
>     > --sebastian
>     >
>     >
>     > On 17.05.2012 23:44, Paolo Castagna wrote:
>     >> Hi Sebastian,
>     >> from SimplePageRankVertex.java, we have:
>     >>
>     >>     if (getSuperstep() < MAX_SUPERSTEPS) {
>     >>       long edges = getNumOutEdges();
>     >>       sendMsgToAllEdges(
>     >>           new DoubleWritable(getVertexValue().get() / edges));
>     >>     } else {
>     >>       voteToHalt();
>     >>     }
>     >>
>     >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling
>     node)?
>     >>
>     >> Paolo
>     >>
>     >> Sebastian Schelter wrote:
>     >>> The implementation implicitly deals with dangling nodes by
>     computing the
>     >>> pageRank as
>     >>>
>     >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>     >>>
>     >>> Conceptually, this is the same as connecting every vertex with every
>     >>> other vertex it is not yet connected to. This removes all
>     dangling nodes
>     >>> from the graph as every vertex can reach every other vertex.
>     >>>
>     >>> "Mining of Massive Datasets" [1] has a very well written
>     explanation of
>     >>> this approach.
>     >>>
>     >>> --sebastian
>     >>>
>     >>>
>     >>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>     >>>
>     >>> On 17.05.2012 16:23, Paolo Castagna wrote:
>     >>>> Hi,
>     >>>> I noticed that the SimplePageRankVertex implementation does not
>     deal with
>     >>>> dangling nodes (dangling nodes are those nodes with no outgoing
>     links).
>     >>>> And, probably that is one of the good reasons why there is a
>     "Simple" in front. ;-)
>     >>>>
>     >>>>  "Dangling nodes exist in many forms. For example, a page of
>     data, a page with a
>     >>>> postscript graph, [...], a page that has been fetched by a
>     crawler but not yet
>     >>>> explored - these are all examples of possible dangling nodes.
>     The more ambitious
>     >>>> the crawl, the bigger the proportion of dangling nodes because
>     the set of
>     >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>     >>>>
>     >>>> Wikipedia's PageRank page says:
>     >>>>
>     >>>>  "If a page has no links to other pages, it becomes a sink and
>     therefore
>     >>>> terminates the random surfing process. If the random surfer
>     arrives at a sink
>     >>>> page, it picks another URL at random and continues surfing
>     again. When
>     >>>> calculating PageRank, pages with no outbound links are assumed
>     to link out to
>     >>>> all other pages in the collection. Their PageRank scores are
>     therefore divided
>     >>>> evenly among all other pages. In other words, to be fair with
>     pages that are not
>     >>>> sinks, these random transitions are added to all nodes in the
>     Web, with a
>     >>>> residual probability usually set to d = 0.85, estimated from
>     the frequency that
>     >>>> an average surfer uses his or her browser's bookmark feature." [2]
>     >>>>
>     >>>> Now, if I want to try to account for the dangling nodes I need
>     to sum all
>     >>>> PageRank values for the dangling nodes and divide it evenly
>     among all other
>     >>>> pages (which in practce means to send a message to all vertexes
>     in Giraph...
>     >>>> which is not possible, as it would be crazy with large graphs).
>     >>>>
>     >>>> I imagine one can use a SumAggregator and alternate computing
>     contribution from
>     >>>> dangling nodes and PageRank values (i.e. PageRank values are
>     computed only every
>     >>>> 2 supersteps).
>     >>>>
>     >>>> What do you think?
>     >>>>
>     >>>> Paolo
>     >>>>
>     >>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
>     Beyond: The
>     >>>> Science of Search Engine Rankings" (2006)
>     >>>>  [2] http://en.wikipedia.org/wiki/PageRank
>     >
> 
> 

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Could you just accumulate the PR of dangling nodes in a variable and divide
it equally in the next iteration?
If you use a preference vector instead of dividing it equally, then you
have also the strongly preferential (and weakly preferential in case you
divide equally) versions of RWR.

Cheers,
--
Gianmarco




On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:

> Sebastian Schelter wrote:
> > You are right, the code would throw an exception here, but I guess it
> > would be sufficient to simply add a check here and not send any messages.
>
> That would avoid throwing an exception, but I am not sure it is actually
> the best thing to do.
> Some suggest to divide the PageRank scores of dangling nodes evenly among
> all other pages.
> If you want, this is equivalent to another random jump, but forced by the
> fact that there are no outgoing links to select from.
> Using your notation below:
>
>  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
> sumOfDanlingNodesRanks / getNumVertices() )
>
> See also pag. 38 of the book ""Google's PageRank and Beyond":
> http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38
>
> In other works, in pseudo code a serial implementation which accounts for
> dangling nodes should be:
>
> --------
> dangling_nodes_ranks = 0
> foreach node in dangling_node
>    dangling_nodes_ranks += ranks.get ( node )
> dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;
>
> foeach node1 in graph
>    r = 0
>    foreach node2 in incoming_links ( node1 )
>        r += ranks.get ( node2 ) / number_outgoing_links ( node2 )
>
> r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N
> --------
>
> What do you think?
>
> Paolo
>
> >
> > --sebastian
> >
> >
> > On 17.05.2012 23:44, Paolo Castagna wrote:
> >> Hi Sebastian,
> >> from SimplePageRankVertex.java, we have:
> >>
> >>     if (getSuperstep() < MAX_SUPERSTEPS) {
> >>       long edges = getNumOutEdges();
> >>       sendMsgToAllEdges(
> >>           new DoubleWritable(getVertexValue().get() / edges));
> >>     } else {
> >>       voteToHalt();
> >>     }
> >>
> >> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
> >>
> >> Paolo
> >>
> >> Sebastian Schelter wrote:
> >>> The implementation implicitly deals with dangling nodes by computing
> the
> >>> pageRank as
> >>>
> >>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
> >>>
> >>> Conceptually, this is the same as connecting every vertex with every
> >>> other vertex it is not yet connected to. This removes all dangling
> nodes
> >>> from the graph as every vertex can reach every other vertex.
> >>>
> >>> "Mining of Massive Datasets" [1] has a very well written explanation of
> >>> this approach.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>> [1] http://infolab.stanford.edu/~ullman/mmds.html
> >>>
> >>> On 17.05.2012 16:23, Paolo Castagna wrote:
> >>>> Hi,
> >>>> I noticed that the SimplePageRankVertex implementation does not deal
> with
> >>>> dangling nodes (dangling nodes are those nodes with no outgoing
> links).
> >>>> And, probably that is one of the good reasons why there is a "Simple"
> in front. ;-)
> >>>>
> >>>>  "Dangling nodes exist in many forms. For example, a page of data, a
> page with a
> >>>> postscript graph, [...], a page that has been fetched by a crawler
> but not yet
> >>>> explored - these are all examples of possible dangling nodes. The
> more ambitious
> >>>> the crawl, the bigger the proportion of dangling nodes because the
> set of
> >>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
> >>>>
> >>>> Wikipedia's PageRank page says:
> >>>>
> >>>>  "If a page has no links to other pages, it becomes a sink and
> therefore
> >>>> terminates the random surfing process. If the random surfer arrives
> at a sink
> >>>> page, it picks another URL at random and continues surfing again. When
> >>>> calculating PageRank, pages with no outbound links are assumed to
> link out to
> >>>> all other pages in the collection. Their PageRank scores are
> therefore divided
> >>>> evenly among all other pages. In other words, to be fair with pages
> that are not
> >>>> sinks, these random transitions are added to all nodes in the Web,
> with a
> >>>> residual probability usually set to d = 0.85, estimated from the
> frequency that
> >>>> an average surfer uses his or her browser's bookmark feature." [2]
> >>>>
> >>>> Now, if I want to try to account for the dangling nodes I need to sum
> all
> >>>> PageRank values for the dangling nodes and divide it evenly among all
> other
> >>>> pages (which in practce means to send a message to all vertexes in
> Giraph...
> >>>> which is not possible, as it would be crazy with large graphs).
> >>>>
> >>>> I imagine one can use a SumAggregator and alternate computing
> contribution from
> >>>> dangling nodes and PageRank values (i.e. PageRank values are computed
> only every
> >>>> 2 supersteps).
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Paolo
> >>>>
> >>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and
> Beyond: The
> >>>> Science of Search Engine Rankings" (2006)
> >>>>  [2] http://en.wikipedia.org/wiki/PageRank
> >
>

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Sebastian Schelter wrote:
> You are right, the code would throw an exception here, but I guess it
> would be sufficient to simply add a check here and not send any messages.

That would avoid throwing an exception, but I am not sure it is actually the best thing to do.
Some suggest to divide the PageRank scores of dangling nodes evenly among all other pages.
If you want, this is equivalent to another random jump, but forced by the fact that there are no outgoing links to select from.
Using your notation below:

  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + sumOfDanlingNodesRanks / getNumVertices() )

See also pag. 38 of the book ""Google's PageRank and Beyond":
http://books.google.co.uk/books?id=hxvB14-I0twC&pg=PA38

In other works, in pseudo code a serial implementation which accounts for dangling nodes should be:

--------
dangling_nodes_ranks = 0
foreach node in dangling_node
    dangling_nodes_ranks += ranks.get ( node )
dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;

foeach node1 in graph
    r = 0
    foreach node2 in incoming_links ( node1 )
        r += ranks.get ( node2 ) / number_outgoing_links ( node2 )

r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N
--------

What do you think?

Paolo

> 
> --sebastian
> 
> 
> On 17.05.2012 23:44, Paolo Castagna wrote:
>> Hi Sebastian,
>> from SimplePageRankVertex.java, we have:
>>
>>     if (getSuperstep() < MAX_SUPERSTEPS) {
>>       long edges = getNumOutEdges();
>>       sendMsgToAllEdges(
>>           new DoubleWritable(getVertexValue().get() / edges));
>>     } else {
>>       voteToHalt();
>>     }
>>
>> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
>>
>> Paolo
>>
>> Sebastian Schelter wrote:
>>> The implementation implicitly deals with dangling nodes by computing the
>>> pageRank as
>>>
>>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>>
>>> Conceptually, this is the same as connecting every vertex with every
>>> other vertex it is not yet connected to. This removes all dangling nodes
>>> from the graph as every vertex can reach every other vertex.
>>>
>>> "Mining of Massive Datasets" [1] has a very well written explanation of
>>> this approach.
>>>
>>> --sebastian
>>>
>>>
>>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>>
>>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>>> Hi,
>>>> I noticed that the SimplePageRankVertex implementation does not deal with
>>>> dangling nodes (dangling nodes are those nodes with no outgoing links).
>>>> And, probably that is one of the good reasons why there is a "Simple" in front. ;-)
>>>>
>>>>  "Dangling nodes exist in many forms. For example, a page of data, a page with a
>>>> postscript graph, [...], a page that has been fetched by a crawler but not yet
>>>> explored - these are all examples of possible dangling nodes. The more ambitious
>>>> the crawl, the bigger the proportion of dangling nodes because the set of
>>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>>
>>>> Wikipedia's PageRank page says:
>>>>
>>>>  "If a page has no links to other pages, it becomes a sink and therefore
>>>> terminates the random surfing process. If the random surfer arrives at a sink
>>>> page, it picks another URL at random and continues surfing again. When
>>>> calculating PageRank, pages with no outbound links are assumed to link out to
>>>> all other pages in the collection. Their PageRank scores are therefore divided
>>>> evenly among all other pages. In other words, to be fair with pages that are not
>>>> sinks, these random transitions are added to all nodes in the Web, with a
>>>> residual probability usually set to d = 0.85, estimated from the frequency that
>>>> an average surfer uses his or her browser's bookmark feature." [2]
>>>>
>>>> Now, if I want to try to account for the dangling nodes I need to sum all
>>>> PageRank values for the dangling nodes and divide it evenly among all other
>>>> pages (which in practce means to send a message to all vertexes in Giraph...
>>>> which is not possible, as it would be crazy with large graphs).
>>>>
>>>> I imagine one can use a SumAggregator and alternate computing contribution from
>>>> dangling nodes and PageRank values (i.e. PageRank values are computed only every
>>>> 2 supersteps).
>>>>
>>>> What do you think?
>>>>
>>>> Paolo
>>>>
>>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
>>>> Science of Search Engine Rankings" (2006)
>>>>  [2] http://en.wikipedia.org/wiki/PageRank
> 

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
You are right, the code would throw an exception here, but I guess it
would be sufficient to simply add a check here and not send any messages.

--sebastian


On 17.05.2012 23:44, Paolo Castagna wrote:
> Hi Sebastian,
> from SimplePageRankVertex.java, we have:
> 
>     if (getSuperstep() < MAX_SUPERSTEPS) {
>       long edges = getNumOutEdges();
>       sendMsgToAllEdges(
>           new DoubleWritable(getVertexValue().get() / edges));
>     } else {
>       voteToHalt();
>     }
> 
> What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?
> 
> Paolo
> 
> Sebastian Schelter wrote:
>> The implementation implicitly deals with dangling nodes by computing the
>> pageRank as
>>
>> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
>>
>> Conceptually, this is the same as connecting every vertex with every
>> other vertex it is not yet connected to. This removes all dangling nodes
>> from the graph as every vertex can reach every other vertex.
>>
>> "Mining of Massive Datasets" [1] has a very well written explanation of
>> this approach.
>>
>> --sebastian
>>
>>
>> [1] http://infolab.stanford.edu/~ullman/mmds.html
>>
>> On 17.05.2012 16:23, Paolo Castagna wrote:
>>> Hi,
>>> I noticed that the SimplePageRankVertex implementation does not deal with
>>> dangling nodes (dangling nodes are those nodes with no outgoing links).
>>> And, probably that is one of the good reasons why there is a "Simple" in front. ;-)
>>>
>>>  "Dangling nodes exist in many forms. For example, a page of data, a page with a
>>> postscript graph, [...], a page that has been fetched by a crawler but not yet
>>> explored - these are all examples of possible dangling nodes. The more ambitious
>>> the crawl, the bigger the proportion of dangling nodes because the set of
>>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>>
>>> Wikipedia's PageRank page says:
>>>
>>>  "If a page has no links to other pages, it becomes a sink and therefore
>>> terminates the random surfing process. If the random surfer arrives at a sink
>>> page, it picks another URL at random and continues surfing again. When
>>> calculating PageRank, pages with no outbound links are assumed to link out to
>>> all other pages in the collection. Their PageRank scores are therefore divided
>>> evenly among all other pages. In other words, to be fair with pages that are not
>>> sinks, these random transitions are added to all nodes in the Web, with a
>>> residual probability usually set to d = 0.85, estimated from the frequency that
>>> an average surfer uses his or her browser's bookmark feature." [2]
>>>
>>> Now, if I want to try to account for the dangling nodes I need to sum all
>>> PageRank values for the dangling nodes and divide it evenly among all other
>>> pages (which in practce means to send a message to all vertexes in Giraph...
>>> which is not possible, as it would be crazy with large graphs).
>>>
>>> I imagine one can use a SumAggregator and alternate computing contribution from
>>> dangling nodes and PageRank values (i.e. PageRank values are computed only every
>>> 2 supersteps).
>>>
>>> What do you think?
>>>
>>> Paolo
>>>
>>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
>>> Science of Search Engine Rankings" (2006)
>>>  [2] http://en.wikipedia.org/wiki/PageRank
>>


Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi Sebastian,
from SimplePageRankVertex.java, we have:

    if (getSuperstep() < MAX_SUPERSTEPS) {
      long edges = getNumOutEdges();
      sendMsgToAllEdges(
          new DoubleWritable(getVertexValue().get() / edges));
    } else {
      voteToHalt();
    }

What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?

Paolo

Sebastian Schelter wrote:
> The implementation implicitly deals with dangling nodes by computing the
> pageRank as
> 
> 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
> 
> Conceptually, this is the same as connecting every vertex with every
> other vertex it is not yet connected to. This removes all dangling nodes
> from the graph as every vertex can reach every other vertex.
> 
> "Mining of Massive Datasets" [1] has a very well written explanation of
> this approach.
> 
> --sebastian
> 
> 
> [1] http://infolab.stanford.edu/~ullman/mmds.html
> 
> On 17.05.2012 16:23, Paolo Castagna wrote:
>> Hi,
>> I noticed that the SimplePageRankVertex implementation does not deal with
>> dangling nodes (dangling nodes are those nodes with no outgoing links).
>> And, probably that is one of the good reasons why there is a "Simple" in front. ;-)
>>
>>  "Dangling nodes exist in many forms. For example, a page of data, a page with a
>> postscript graph, [...], a page that has been fetched by a crawler but not yet
>> explored - these are all examples of possible dangling nodes. The more ambitious
>> the crawl, the bigger the proportion of dangling nodes because the set of
>> fetched but uncrawled pages grows quickly." (pag. 80) [1]
>>
>> Wikipedia's PageRank page says:
>>
>>  "If a page has no links to other pages, it becomes a sink and therefore
>> terminates the random surfing process. If the random surfer arrives at a sink
>> page, it picks another URL at random and continues surfing again. When
>> calculating PageRank, pages with no outbound links are assumed to link out to
>> all other pages in the collection. Their PageRank scores are therefore divided
>> evenly among all other pages. In other words, to be fair with pages that are not
>> sinks, these random transitions are added to all nodes in the Web, with a
>> residual probability usually set to d = 0.85, estimated from the frequency that
>> an average surfer uses his or her browser's bookmark feature." [2]
>>
>> Now, if I want to try to account for the dangling nodes I need to sum all
>> PageRank values for the dangling nodes and divide it evenly among all other
>> pages (which in practce means to send a message to all vertexes in Giraph...
>> which is not possible, as it would be crazy with large graphs).
>>
>> I imagine one can use a SumAggregator and alternate computing contribution from
>> dangling nodes and PageRank values (i.e. PageRank values are computed only every
>> 2 supersteps).
>>
>> What do you think?
>>
>> Paolo
>>
>>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
>> Science of Search Engine Rankings" (2006)
>>  [2] http://en.wikipedia.org/wiki/PageRank
> 

Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

Posted by Sebastian Schelter <ss...@apache.org>.
The implementation implicitly deals with dangling nodes by computing the
pageRank as

0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks

Conceptually, this is the same as connecting every vertex with every
other vertex it is not yet connected to. This removes all dangling nodes
from the graph as every vertex can reach every other vertex.

"Mining of Massive Datasets" [1] has a very well written explanation of
this approach.

--sebastian


[1] http://infolab.stanford.edu/~ullman/mmds.html

On 17.05.2012 16:23, Paolo Castagna wrote:
> Hi,
> I noticed that the SimplePageRankVertex implementation does not deal with
> dangling nodes (dangling nodes are those nodes with no outgoing links).
> And, probably that is one of the good reasons why there is a "Simple" in front. ;-)
> 
>  "Dangling nodes exist in many forms. For example, a page of data, a page with a
> postscript graph, [...], a page that has been fetched by a crawler but not yet
> explored - these are all examples of possible dangling nodes. The more ambitious
> the crawl, the bigger the proportion of dangling nodes because the set of
> fetched but uncrawled pages grows quickly." (pag. 80) [1]
> 
> Wikipedia's PageRank page says:
> 
>  "If a page has no links to other pages, it becomes a sink and therefore
> terminates the random surfing process. If the random surfer arrives at a sink
> page, it picks another URL at random and continues surfing again. When
> calculating PageRank, pages with no outbound links are assumed to link out to
> all other pages in the collection. Their PageRank scores are therefore divided
> evenly among all other pages. In other words, to be fair with pages that are not
> sinks, these random transitions are added to all nodes in the Web, with a
> residual probability usually set to d = 0.85, estimated from the frequency that
> an average surfer uses his or her browser's bookmark feature." [2]
> 
> Now, if I want to try to account for the dangling nodes I need to sum all
> PageRank values for the dangling nodes and divide it evenly among all other
> pages (which in practce means to send a message to all vertexes in Giraph...
> which is not possible, as it would be crazy with large graphs).
> 
> I imagine one can use a SumAggregator and alternate computing contribution from
> dangling nodes and PageRank values (i.e. PageRank values are computed only every
> 2 supersteps).
> 
> What do you think?
> 
> Paolo
> 
>  [1] Amy N. Langville and Carl D. Meyer, "Google's PageRank and Beyond: The
> Science of Search Engine Rankings" (2006)
>  [2] http://en.wikipedia.org/wiki/PageRank