You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/10/23 18:28:51 UTC

LinkRank to converge automatically

Hi,

We would to see if we can make LinkRank to terminate itself if a certain 
convergence is reached more or less. To do so i asume we would have to store 
the previously calculated score in LinkDatum, calculate a delta and make some 
ratio of and compare with a predefined threshold.

I also asume we would have to do this on a global level (entire graph) because 
many links will never change score.

Any insights to schare?

Thanks

Re: LinkRank to converge automatically

Posted by Markus Jelsma <ma...@openindex.io>.


On Friday 28 October 2011 00:24:36 Andrzej Bialecki wrote:
> On 27/10/2011 23:36, Markus Jelsma wrote:
> [..]
> 
> >> FWIR one optimization is to look at the change over two passes of the
> >> mass for a given page, as it typically ping-poings between high/low
> >> values as it converges. From that, you can guess at what the final
> >> value would be, and adjust the mass towards that value.
> > 
> > Very interesting, i've not noticed and wouldn't expect some pingpong
> > between higher and lower values before convergence.
> > 
> >  From what i read in some literature i would have to subtract these two
> > 
> > LinkRank vectors (current and previous), calculate the norm and compare
> > with a defined threshold. The issue now is how. Processing vectors of
> > this size would require an additional mapred and cannot be calculated
> > on-the-fly. Unless i'm not yet aware of some fine trick.
> 
> Aggregations like this can be collected in Hadoop Counter-s, and written
> to a side-effect file at the end of a job. E.g. we can create a counter
> to sum up all mass from dangling nodes, and to sum up the total number
> of nodes (and the number of dangling nodes). Then we redistribute this
> mass to all nodes as an initialization parameter for the next iteration.
> Similarly, if we keep track of the previous iteration's score then we
> can sum up deltas from the next iteration to give us an L1 norm.

Of course!!! Thanks for sharing this trick. 

> 
> Hadoop Counter-s are of Long type, so they don't work directly with
> float values. I used the following simple trick in other mapred jobs:
> multiply float values by 10,000 and round to a Long, and after the job
> is done divide the result by 10,000 and cast it to a float. Having a
> precision down to four decimal places should be enough for everybody ;)


-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: LinkRank to converge automatically

Posted by Marek Bachmann <m....@uni-kassel.de>.

Thank you very much :)

On 28.10.2011 15:24, Markus Jelsma wrote:
>
>
> Web and data mining has a part on PageRank and HITS:
> http://www.amazon.com/gp/product/3642194591/
>
> But i think Ken's suggestion would be more appropriate for such a thesis:
> http://www.amazon.com/Googles-PageRank-Beyond-Science-Rankings/dp/069112202
>
>
> On Friday 28 October 2011 15:14:48 Marek Bachmann wrote:
>> Hi Markus,
>>
>>>    From what i read in some literature i would have to subtract these two
>>>
>>> LinkRank vectors (current and previous), calculate the norm and compare
>>> with a defined threshold. The issue now is how. Processing vectors of
>>> this size would require an additional mapred and cannot be calculated
>>> on-the-fly. Unless i'm not yet aware of some fine trick.
>>>
>>> Thanks again!
>>
>> can you give me suggestions for literature about the linkrank /
>> webgraph? Since I am planing to write a thesis about nutch and solr this
>> would be very helpful for me.
>>
>> Thank you.
>

Re: LinkRank to converge automatically

Posted by Markus Jelsma <ma...@openindex.io>.


Web and data mining has a part on PageRank and HITS:
http://www.amazon.com/gp/product/3642194591/

But i think Ken's suggestion would be more appropriate for such a thesis:
http://www.amazon.com/Googles-PageRank-Beyond-Science-Rankings/dp/069112202


On Friday 28 October 2011 15:14:48 Marek Bachmann wrote:
> Hi Markus,
> 
> >  From what i read in some literature i would have to subtract these two
> > 
> > LinkRank vectors (current and previous), calculate the norm and compare
> > with a defined threshold. The issue now is how. Processing vectors of
> > this size would require an additional mapred and cannot be calculated
> > on-the-fly. Unless i'm not yet aware of some fine trick.
> > 
> > Thanks again!
> 
> can you give me suggestions for literature about the linkrank /
> webgraph? Since I am planing to write a thesis about nutch and solr this
> would be very helpful for me.
> 
> Thank you.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: LinkRank to converge automatically

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 28/10/2011 17:57, Ken Krugler wrote:
>
> On Oct 27, 2011, at 11:24pm, Andrzej Bialecki wrote:
>
>> On 27/10/2011 23:36, Markus Jelsma wrote:
>> [..]
>>>> FWIR one optimization is to look at the change over two passes of the mass
>>>> for a given page, as it typically ping-poings between high/low values as
>>>> it converges. From that, you can guess at what the final value would be,
>>>> and adjust the mass towards that value.
>>>
>>> Very interesting, i've not noticed and wouldn't expect some pingpong between
>>> higher and lower values before convergence.
>>>
>>>  From what i read in some literature i would have to subtract these two
>>> LinkRank vectors (current and previous), calculate the norm and compare with a
>>> defined threshold. The issue now is how. Processing vectors of this size would
>>> require an additional mapred and cannot be calculated on-the-fly. Unless i'm
>>> not yet aware of some fine trick.
>>
>> Aggregations like this can be collected in Hadoop Counter-s, and written to a side-effect file at the end of a job. E.g. we can create a counter to sum up all mass from dangling nodes, and to sum up the total number of nodes (and the number of dangling nodes). Then we redistribute this mass to all nodes as an initialization parameter for the next iteration. Similarly, if we keep track of the previous iteration's score then we can sum up deltas from the next iteration to give us an L1 norm.
>>
>> Hadoop Counter-s are of Long type, so they don't work directly with float values. I used the following simple trick in other mapred jobs: multiply float values by 10,000 and round to a Long, and after the job is done divide the result by 10,000 and cast it to a float. Having a precision down to four decimal places should be enough for everybody ;)
>
> I think for PageRank this is likely to not have sufficient precision.
>
> FWIR, in Jimmy Lin's implementation he had to write up some special routines to avoid running into problems with dropping bits during math operations.
>
> Since he starts out with a total mass of 1.0, when this gets spread across 1 billion pages you get some very, very small values.

Ah, yes, in this case it would be insufficient - I used that trick for 
much larger initial values. If you start with total 1.0 then a scaling 
factor of O(vertex_count) would be more like it - in this case e.g. 1e9.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: LinkRank to converge automatically

Posted by Ken Krugler <kk...@transpac.com>.

On Oct 27, 2011, at 11:24pm, Andrzej Bialecki wrote:

> On 27/10/2011 23:36, Markus Jelsma wrote:
> [..]
>>> FWIR one optimization is to look at the change over two passes of the mass
>>> for a given page, as it typically ping-poings between high/low values as
>>> it converges. From that, you can guess at what the final value would be,
>>> and adjust the mass towards that value.
>> 
>> Very interesting, i've not noticed and wouldn't expect some pingpong between
>> higher and lower values before convergence.
>> 
>> From what i read in some literature i would have to subtract these two
>> LinkRank vectors (current and previous), calculate the norm and compare with a
>> defined threshold. The issue now is how. Processing vectors of this size would
>> require an additional mapred and cannot be calculated on-the-fly. Unless i'm
>> not yet aware of some fine trick.
> 
> Aggregations like this can be collected in Hadoop Counter-s, and written to a side-effect file at the end of a job. E.g. we can create a counter to sum up all mass from dangling nodes, and to sum up the total number of nodes (and the number of dangling nodes). Then we redistribute this mass to all nodes as an initialization parameter for the next iteration. Similarly, if we keep track of the previous iteration's score then we can sum up deltas from the next iteration to give us an L1 norm.
> 
> Hadoop Counter-s are of Long type, so they don't work directly with float values. I used the following simple trick in other mapred jobs: multiply float values by 10,000 and round to a Long, and after the job is done divide the result by 10,000 and cast it to a float. Having a precision down to four decimal places should be enough for everybody ;)

I think for PageRank this is likely to not have sufficient precision.

FWIR, in Jimmy Lin's implementation he had to write up some special routines to avoid running into problems with dropping bits during math operations.

Since he starts out with a total mass of 1.0, when this gets spread across 1 billion pages you get some very, very small values.

-- Ken

--------------------------
Ken Krugler
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: LinkRank to converge automatically

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 27/10/2011 23:36, Markus Jelsma wrote:
[..]
>> FWIR one optimization is to look at the change over two passes of the mass
>> for a given page, as it typically ping-poings between high/low values as
>> it converges. From that, you can guess at what the final value would be,
>> and adjust the mass towards that value.
>
> Very interesting, i've not noticed and wouldn't expect some pingpong between
> higher and lower values before convergence.
>
>  From what i read in some literature i would have to subtract these two
> LinkRank vectors (current and previous), calculate the norm and compare with a
> defined threshold. The issue now is how. Processing vectors of this size would
> require an additional mapred and cannot be calculated on-the-fly. Unless i'm
> not yet aware of some fine trick.

Aggregations like this can be collected in Hadoop Counter-s, and written 
to a side-effect file at the end of a job. E.g. we can create a counter 
to sum up all mass from dangling nodes, and to sum up the total number 
of nodes (and the number of dangling nodes). Then we redistribute this 
mass to all nodes as an initialization parameter for the next iteration. 
Similarly, if we keep track of the previous iteration's score then we 
can sum up deltas from the next iteration to give us an L1 norm.

Hadoop Counter-s are of Long type, so they don't work directly with 
float values. I used the following simple trick in other mapred jobs: 
multiply float values by 10,000 and round to a Long, and after the job 
is done divide the result by 10,000 and cast it to a float. Having a 
precision down to four decimal places should be enough for everybody ;)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: LinkRank to converge automatically

Posted by Marek Bachmann <m....@uni-kassel.de>.

Hi Markus,

>  From what i read in some literature i would have to subtract these two
> LinkRank vectors (current and previous), calculate the norm and compare with a
> defined threshold. The issue now is how. Processing vectors of this size would
> require an additional mapred and cannot be calculated on-the-fly. Unless i'm
> not yet aware of some fine trick.
>
> Thanks again!
>

can you give me suggestions for literature about the linkrank / 
webgraph? Since I am planing to write a thesis about nutch and solr this 
would be very helpful for me.

Thank you.

Re: LinkRank to converge automatically

Posted by Markus Jelsma <ma...@openindex.io>.

> >> On Oct 23, 2011, at 6:28pm, Markus Jelsma wrote:
> >>> Hi,
> >>> 
> >>> We would to see if we can make LinkRank to terminate itself if a
> >>> certain convergence is reached more or less. To do so i asume we would
> >>> have to store the previously calculated score in LinkDatum, calculate
> >>> a delta and make some ratio of and compare with a predefined
> >>> threshold.
> >>> 
> >>> I also asume we would have to do this on a global level (entire graph)
> >>> because many links will never change score.
> >>> 
> >>> Any insights to schare?
> >> 
> >> We modified Jimmy Lin's PageRank code (from Cloud9 project at CMU) to
> >> terminate if the missing mass change was less than a provided threshold.
> >> 
> >> At least for classic PR (don't know details of LinkRank) as the results
> >> converge, the amount of missing mass at the end of each iteration also
> >> decreases, thus this served as an easy termination check.
> > 
> > Can you expand on mass? Do you mean a sum of delta's between scores, how
> > much the total change of all urls is between iterations?
> 
> Missing mass is mass from edge pages that don't link to anything. During a
> PageRank iteration, all of this mass winds up getting "lost".
> 
> So part of the algorithm is to sum the mass that does wind up somewhere,
> subtract that from the starting mass (1.0), and then redistribute that
> (along with the mass excluded by the alpha factor, I think this is the
> same as the dampening factor) to all of the pages.
> 
> Since edge pages typically wind up not being as important as 'hub' pages,
> the mass in edge pages (and thus the missing mass) will decrease during
> iterations.
> 
> >> If you start storing score deltas, then you can apply other
> >> optimizations to get convergence to happen faster. For a good survey of
> >> techniques, I'd recommend:
> >> 
> >> http://www.amazon.com/Googles-PageRank-Beyond-Science-Rankings/dp/069112
> >> 202 4
> > 
> > Very interesting. Are you able to list a few optimizations that become
> > possible when storing delta's?
> 
> FWIR one optimization is to look at the change over two passes of the mass
> for a given page, as it typically ping-poings between high/low values as
> it converges. From that, you can guess at what the final value would be,
> and adjust the mass towards that value.

Very interesting, i've not noticed and wouldn't expect some pingpong between 
higher and lower values before convergence.

From what i read in some literature i would have to subtract these two 
LinkRank vectors (current and previous), calculate the norm and compare with a 
defined threshold. The issue now is how. Processing vectors of this size would 
require an additional mapred and cannot be calculated on-the-fly. Unless i'm 
not yet aware of some fine trick.

Thanks again!

> 
> -- Ken
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr

Re: LinkRank to converge automatically

Posted by Ken Krugler <kk...@transpac.com>.

Hi Markus,

I'm on vacation right now (Prague, then London) so I don't have the book or all of the code in front of me, sorry.

See below for what I remember.

On Oct 23, 2011, at 8:53pm, Markus Jelsma wrote:

> Hello Ken,
> 
>> Hi Markus,
>> 
>> On Oct 23, 2011, at 6:28pm, Markus Jelsma wrote:
>>> Hi,
>>> 
>>> We would to see if we can make LinkRank to terminate itself if a certain
>>> convergence is reached more or less. To do so i asume we would have to
>>> store the previously calculated score in LinkDatum, calculate a delta
>>> and make some ratio of and compare with a predefined threshold.
>>> 
>>> I also asume we would have to do this on a global level (entire graph)
>>> because many links will never change score.
>>> 
>>> Any insights to schare?
>> 
>> We modified Jimmy Lin's PageRank code (from Cloud9 project at CMU) to
>> terminate if the missing mass change was less than a provided threshold.
>> 
>> At least for classic PR (don't know details of LinkRank) as the results
>> converge, the amount of missing mass at the end of each iteration also
>> decreases, thus this served as an easy termination check.
> 
> Can you expand on mass? Do you mean a sum of delta's between scores, how much 
> the total change of all urls is between iterations?

Missing mass is mass from edge pages that don't link to anything. During a PageRank iteration, all of this mass winds up getting "lost".

So part of the algorithm is to sum the mass that does wind up somewhere, subtract that from the starting mass (1.0), and then redistribute that (along with the mass excluded by the alpha factor, I think this is the same as the dampening factor) to all of the pages.

Since edge pages typically wind up not being as important as 'hub' pages, the mass in edge pages (and thus the missing mass) will decrease during iterations.

>> If you start storing score deltas, then you can apply other optimizations
>> to get convergence to happen faster. For a good survey of techniques, I'd
>> recommend:
>> 
>> http://www.amazon.com/Googles-PageRank-Beyond-Science-Rankings/dp/069112202
>> 4
> 
> Very interesting. Are you able to list a few optimizations that become 
> possible when storing delta's?

FWIR one optimization is to look at the change over two passes of the mass for a given page, as it typically ping-poings between high/low values as it converges. From that, you can guess at what the final value would be, and adjust the mass towards that value.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: LinkRank to converge automatically

Posted by Markus Jelsma <ma...@openindex.io>.

Hello Ken,

> Hi Markus,
> 
> On Oct 23, 2011, at 6:28pm, Markus Jelsma wrote:
> > Hi,
> > 
> > We would to see if we can make LinkRank to terminate itself if a certain
> > convergence is reached more or less. To do so i asume we would have to
> > store the previously calculated score in LinkDatum, calculate a delta
> > and make some ratio of and compare with a predefined threshold.
> > 
> > I also asume we would have to do this on a global level (entire graph)
> > because many links will never change score.
> > 
> > Any insights to schare?
> 
> We modified Jimmy Lin's PageRank code (from Cloud9 project at CMU) to
> terminate if the missing mass change was less than a provided threshold.
> 
> At least for classic PR (don't know details of LinkRank) as the results
> converge, the amount of missing mass at the end of each iteration also
> decreases, thus this served as an easy termination check.

Can you expand on mass? Do you mean a sum of delta's between scores, how much 
the total change of all urls is between iterations?

> 
> If you start storing score deltas, then you can apply other optimizations
> to get convergence to happen faster. For a good survey of techniques, I'd
> recommend:
> 
> http://www.amazon.com/Googles-PageRank-Beyond-Science-Rankings/dp/069112202
> 4

Very interesting. Are you able to list a few optimizations that become 
possible when storing delta's?

Thanks a lot for sharing your useful insights.
Markus

> 
> -- Ken
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr

Re: LinkRank to converge automatically

Posted by Ken Krugler <kk...@transpac.com>.

Hi Markus,

On Oct 23, 2011, at 6:28pm, Markus Jelsma wrote:

> Hi,
> 
> We would to see if we can make LinkRank to terminate itself if a certain 
> convergence is reached more or less. To do so i asume we would have to store 
> the previously calculated score in LinkDatum, calculate a delta and make some 
> ratio of and compare with a predefined threshold.
> 
> I also asume we would have to do this on a global level (entire graph) because 
> many links will never change score.
> 
> Any insights to schare?

We modified Jimmy Lin's PageRank code (from Cloud9 project at CMU) to terminate if the missing mass change was less than a provided threshold.

At least for classic PR (don't know details of LinkRank) as the results converge, the amount of missing mass at the end of each iteration also decreases, thus this served as an easy termination check.

If you start storing score deltas, then you can apply other optimizations to get convergence to happen faster. For a good survey of techniques, I'd recommend:

http://www.amazon.com/Googles-PageRank-Beyond-Science-Rankings/dp/0691122024

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr