You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Scott Sayles <ss...@fgm.com> on 2004/05/18 19:04:47 UTC

Page ranking

Is there anyone out there that has page ranking implemented on top of
Lucene?

Just in case anyone may be thinking otherwise, when I say page ranking
I'm not referring to the ranking of results from searches.  I'm talking
about something similar to how google computes what page may be more
relevant or important (often referred to as PageRank) which is effected
in part by how many other pages reference that page.

I've been through the examples listed here:

http://www.iprcom.com/papers/pagerank/index.html

which provides information from the origianl google paper about page
ranking.  Running the examples are fairly easy, but the big question I
have is how can I practically update such data?  And is there any
potential integration with Lucene?  It would seem that one could store
the computed ranking values in the actual Lucene Document itself, but
the updates would be fairly laborious as a few minor changes in rankings
can produce a large ripple in other related document rankings.  This, of
course, would be the same issue if the ranking information were stored
outside of Lucene.  One could potentially store this in a separate
database and then look up the ranking information for each document
found and then perform updates as an external asynchronous task.

Anyone have any experience with maintaining page rankings?


Thanks,

Scott


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On May 18, 2004, at 3:56 PM, Claude Devarenne wrote:
> Thanks, I'll try that.  It would nice too if I could extend field (it 
> is a final class) and create a numerical field.  Is that not 
> desirable?

It isn't that much more effort to have something like NumberUtils 
listed here: 
http://wiki.apache.org/jakarta-lucene/SearchNumericalFields

I'm not sure of the pros/cons to making Field extensible or not, but it 
really is of marginal benefit since it ultimately it needs a String and 
a conversion of numeric to String in your own code isn't involved.  I 
suppose we could put something like NumberUtils (maybe called 
NumberField to be like DateField) in the core to have a built-in 
solution.  We probably ought to also go another step and provide Date 
-> YYYYMMDD conversion as additional parts to DateField.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Claude Devarenne <cl...@library.ucsf.edu>.

Thanks, I'll try that.  It would nice too if I could extend field (it 
is a final class) and create a numerical field.  Is that not desirable?

Claude

On May 18, 2004, at 12:06 PM, Ype Kingma wrote:

> On Tuesday 18 May 2004 19:38, Claude Devarenne wrote:
>> Hi,
>>
>> I have over 60,000 documents in my index which is slightly over a 1 GB
>> in size.  The documents range from the late seventies up to now.  I
>> have indexed dates as a keyword field using a string because the dates
>> are in YYYYMMDD format.  When I do range queries things are OK as long
>> as I don't exceed the built-in number of boolean clauses, so that's a
>> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
>> complex queries but also want to query over long ranges, e.g. 
>> [19790101
>> TO 19991231].
>>
>> Given these requirements, I am thinking of doing a query without the
>> date range, bring the unique ids back from the hits and then do a date
>> query in the SQL database I have that contains the same data.  Another
>> alternative is to do the query without the date range in Lucene and
>> then sort the results within the range.  I still have to learn how to
>> use the new sorting code and confessed I did not have time to look at
>> it yet.
>>
>> Is there a simpler, easier way to do this?
>
> I wouldn't know of a simpler and easier way, but there is another way
> to reduce the number of clauses involved in long date ranges.
> This can be done by indexing not only YYYYMMDD but also YYYYMM and
> YYYY, and adapting the query range mechanism to use the shorter term
> whenever possible. (YYY and YYYYMMD might also be useful.)
>
>
> Kind regards,
> Ype
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Ype Kingma <yk...@xs4all.nl>.

On Tuesday 18 May 2004 19:38, Claude Devarenne wrote:
> Hi,
>
> I have over 60,000 documents in my index which is slightly over a 1 GB
> in size.  The documents range from the late seventies up to now.  I
> have indexed dates as a keyword field using a string because the dates
> are in YYYYMMDD format.  When I do range queries things are OK as long
> as I don't exceed the built-in number of boolean clauses, so that's a
> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
> complex queries but also want to query over long ranges, e.g. [19790101
> TO 19991231].
>
> Given these requirements, I am thinking of doing a query without the
> date range, bring the unique ids back from the hits and then do a date
> query in the SQL database I have that contains the same data.  Another
> alternative is to do the query without the date range in Lucene and
> then sort the results within the range.  I still have to learn how to
> use the new sorting code and confessed I did not have time to look at
> it yet.
>
> Is there a simpler, easier way to do this?

I wouldn't know of a simpler and easier way, but there is another way
to reduce the number of clauses involved in long date ranges.
This can be done by indexing not only YYYYMMDD but also YYYYMM and
YYYY, and adapting the query range mechanism to use the shorter term
whenever possible. (YYY and YYYYMMD might also be useful.)


Kind regards,
Ype


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Claude Devarenne <cl...@library.ucsf.edu>.

Thanks, I'll try that first and then Ype's suggestion if necessary.  I 
have been shying away from filters so now I have no excuse ;-)

Claude

On May 18, 2004, at 1:35 PM, Andy Goodell wrote:

> In our application we had a similar problem with non-date ranges until
> we realized that it wasnt so much that we were searching for the
> values in the range as restricting the search to that range, and then
> we used an extension to the org.apache.lucene.search.Filter class, and
> our implementation got much simpler and faster.
>
> - andy g
>
> On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne
> <cl...@library.ucsf.edu> wrote:
>>
>> Hi,
>>
>> I have over 60,000 documents in my index which is slightly over a 1 GB
>> in size.  The documents range from the late seventies up to now.  I
>> have indexed dates as a keyword field using a string because the dates
>> are in YYYYMMDD format.  When I do range queries things are OK as long
>> as I don't exceed the built-in number of boolean clauses, so that's a
>> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
>> complex queries but also want to query over long ranges, e.g. 
>> [19790101
>> TO 19991231].
>>
>> Given these requirements, I am thinking of doing a query without the
>> date range, bring the unique ids back from the hits and then do a date
>> query in the SQL database I have that contains the same data.  Another
>> alternative is to do the query without the date range in Lucene and
>> then sort the results within the range.  I still have to learn how to
>> use the new sorting code and confessed I did not have time to look at
>> it yet.
>>
>> Is there a simpler, easier way to do this?
>>
>> Claude
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Andy Goodell <go...@gmail.com>.

In our application we had a similar problem with non-date ranges until
we realized that it wasnt so much that we were searching for the
values in the range as restricting the search to that range, and then
we used an extension to the org.apache.lucene.search.Filter class, and
our implementation got much simpler and faster.

- andy g

On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne
<cl...@library.ucsf.edu> wrote:
> 
> Hi,
> 
> I have over 60,000 documents in my index which is slightly over a 1 GB
> in size.  The documents range from the late seventies up to now.  I
> have indexed dates as a keyword field using a string because the dates
> are in YYYYMMDD format.  When I do range queries things are OK as long
> as I don't exceed the built-in number of boolean clauses, so that's a
> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
> complex queries but also want to query over long ranges, e.g. [19790101
> TO 19991231].
> 
> Given these requirements, I am thinking of doing a query without the
> date range, bring the unique ids back from the hits and then do a date
> query in the SQL database I have that contains the same data.  Another
> alternative is to do the query without the date range in Lucene and
> then sort the results within the range.  I still have to learn how to
> use the new sorting code and confessed I did not have time to look at
> it yet.
> 
> Is there a simpler, easier way to do this?
> 
> Claude
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Claude Devarenne <cl...@library.ucsf.edu>.

Thanks,  I will look at the sorting code.  Sorting results by date is 
next on list.  For now, I only have a small number of documents but the 
set is to grow to over 8 million documents for the collection I am 
working on.  Another collection we have is 40 million documents or so.  
 From what you say it seems to me that sorting will not scale then when 
I get to larger number of documents.  I am considering using an SQL 
back end to implement sorting: bring back the unique IDs from lucene 
and then sort in SQL.

Claude

On May 18, 2004, at 11:23 PM, Morus Walter wrote:

> Claude Devarenne writes:
>> Hi,
>>
>> I have over 60,000 documents in my index which is slightly over a 1 GB
>> in size.  The documents range from the late seventies up to now.  I
>> have indexed dates as a keyword field using a string because the dates
>> are in YYYYMMDD format.  When I do range queries things are OK as long
>> as I don't exceed the built-in number of boolean clauses, so that's a
>> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
>> complex queries but also want to query over long ranges, e.g. 
>> [19790101
>> TO 19991231].
>>
>> Given these requirements, I am thinking of doing a query without the
>> date range, bring the unique ids back from the hits and then do a date
>> query in the SQL database I have that contains the same data.  Another
>> alternative is to do the query without the date range in Lucene and
>> then sort the results within the range.  I still have to learn how to
>> use the new sorting code and confessed I did not have time to look at
>> it yet.
>>
>> Is there a simpler, easier way to do this?
>>
> I think it would be worth to take a look at the sorting code.
>
> The idea of the sorting code is to have an array of the dates for each 
> doc
> in memory and access this array for sorting.
> Now sorting isn't the only thing one might use this array for.
> Doing a range check is another.
> So you might extend the sorting code by a range selection.
>
> There is no code for this in lucene and you have to create your own 
> searcher
> but it gives you a fast way to search and sort by date.
>
> I did this independently from the new sorting code (I just started a 
> little
> to early) and it works quite well.
> The only drawback from this (and the new sorting code) is, that it 
> requires
> an array of field values that must be rebuilt each time the index 
> changes.
> Shouldn't be a problem for 60000 documents.
>
> Morus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Morus Walter <mo...@tanto.de>.

Claude Devarenne writes:
> Hi,
> 
> I have over 60,000 documents in my index which is slightly over a 1 GB 
> in size.  The documents range from the late seventies up to now.  I 
> have indexed dates as a keyword field using a string because the dates 
> are in YYYYMMDD format.  When I do range queries things are OK as long 
> as I don't exceed the built-in number of boolean clauses, so that's a 
> range of 3 years, e.g. 1979 to 1981.  The users are not only doing 
> complex queries but also want to query over long ranges, e.g. [19790101 
> TO 19991231].
> 
> Given these requirements, I am thinking of doing a query without the 
> date range, bring the unique ids back from the hits and then do a date 
> query in the SQL database I have that contains the same data.  Another 
> alternative is to do the query without the date range in Lucene and 
> then sort the results within the range.  I still have to learn how to 
> use the new sorting code and confessed I did not have time to look at 
> it yet.
> 
> Is there a simpler, easier way to do this?
> 
I think it would be worth to take a look at the sorting code.

The idea of the sorting code is to have an array of the dates for each doc
in memory and access this array for sorting.
Now sorting isn't the only thing one might use this array for.
Doing a range check is another.
So you might extend the sorting code by a range selection.

There is no code for this in lucene and you have to create your own searcher
but it gives you a fast way to search and sort by date.

I did this independently from the new sorting code (I just started a little
to early) and it works quite well.
The only drawback from this (and the new sorting code) is, that it requires
an array of field values that must be rebuilt each time the index changes.
Shouldn't be a problem for 60000 documents.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Matt Quail <ma...@ctx.com.au>.

 > Is there a simpler, easier way to do this?

Yes. I have started implementing a "QuickRangeQuery" class, that doesn't 
have the BooleanQuery limitation, but scores every matching document as 1.0.

I will see if I can get it finished in the next 24 hours, and post back 
to this thread.

=Matt

PS: I'm not sure about the "QuickRangeQuery" class name... maybe 
"NormalizedRangeQuery", "RangeQuery2"... *shrug*

Claude Devarenne wrote:
> Hi,
> 
> I have over 60,000 documents in my index which is slightly over a 1 GB 
> in size.  The documents range from the late seventies up to now.  I have 
> indexed dates as a keyword field using a string because the dates are in 
> YYYYMMDD format.  When I do range queries things are OK as long as I 
> don't exceed the built-in number of boolean clauses, so that's a range 
> of 3 years, e.g. 1979 to 1981.  The users are not only doing complex 
> queries but also want to query over long ranges, e.g. [19790101 TO 
> 19991231].
> 
> Given these requirements, I am thinking of doing a query without the 
> date range, bring the unique ids back from the hits and then do a date 
> query in the SQL database I have that contains the same data.  Another 
> alternative is to do the query without the date range in Lucene and then 
> sort the results within the range.  I still have to learn how to use the 
> new sorting code and confessed I did not have time to look at it yet.
> 
> Is there a simpler, easier way to do this?
> 
> Claude
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

How to handle range queries over large ranges and avoid Too Many Boolean clauses

Posted by Claude Devarenne <cl...@library.ucsf.edu>.

Hi,

I have over 60,000 documents in my index which is slightly over a 1 GB 
in size.  The documents range from the late seventies up to now.  I 
have indexed dates as a keyword field using a string because the dates 
are in YYYYMMDD format.  When I do range queries things are OK as long 
as I don't exceed the built-in number of boolean clauses, so that's a 
range of 3 years, e.g. 1979 to 1981.  The users are not only doing 
complex queries but also want to query over long ranges, e.g. [19790101 
TO 19991231].

Given these requirements, I am thinking of doing a query without the 
date range, bring the unique ids back from the hits and then do a date 
query in the SQL database I have that contains the same data.  Another 
alternative is to do the query without the date range in Lucene and 
then sort the results within the range.  I still have to learn how to 
use the new sorting code and confessed I did not have time to look at 
it yet.

Is there a simpler, easier way to do this?

Claude


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Page ranking

Posted by David Spencer <da...@tropo.com>.

Scott Sayles wrote:

>Is there anyone out there that has page ranking implemented on top of
>Lucene?
>  
>

I recently discovered JUNG which has 2 impls of PageRank:

http://jung.sourceforge.net/api/1.4.1/edu/uci/ics/jung/algorithms/importance/PageRank.html

I did a test of hooking it up to my spider and calculating pagerank of 
all pages in a javadoc tree (experimented with both 
http://jakarta.apache.org/lucene/docs/api/overview-summary.html and 
http://java.sun.com/j2se/1.4.2/docs/api/overview-summary.html). 

The basic prodcedure is
[1] grab all pages to a local cache while building a table of page->page 
links
[2] using the page->page link data, calculate pageranks with JUNG and 
cache this
[3] go thru cache and index the pages ( to a Lucene index), setting each 
documents boost (Document.setBoost()) to the pagerank value

I've just got this going over the weekend. Prelim results are 
disappointing.  Pages like 
http://java.sun.com/j2se/1.4.2/docs/api/deprecated-list.html get a high 
pagerank as all kinds of pages link to it, though when I search javadoc 
I never want that page. It might be this turns out better however - I'm 
not doing any query expansion now, though next pass I'll auto-boost for 
title matches.

I can make available a table of pageranks (URL,pagerank pairs) for these 
runs if people want.

>Just in case anyone may be thinking otherwise, when I say page ranking
>I'm not referring to the ranking of results from searches.  I'm talking
>about something similar to how google computes what page may be more
>relevant or important (often referred to as PageRank) which is effected
>in part by how many other pages reference that page.
>
>I've been through the examples listed here:
>
>http://www.iprcom.com/papers/pagerank/index.html
>
>which provides information from the origianl google paper about page
>ranking.  Running the examples are fairly easy, but the big question I
>have is how can I practically update such data?  
>

I think this is a batch operation, you have to precalc it when indexing 
the entire collection.

>And is there any
>potential integration with Lucene? 
>

My thoughts are Doc.setBoost or just a plain field and store it there 
and use it to sort the results.

> It would seem that one could store
>the computed ranking values in the actual Lucene Document itself, but
>the updates 
>

Unless something has changed, index are "write-only". You really can't 
update an index other than deleting a doc and readding it, and to calc 
pagerank you need all links between pages.

>would be fairly laborious as a few minor changes in rankings
>can produce a large ripple in other related document rankings.  This, of
>course, would be the same issue if the ranking information were stored
>outside of Lucene.  One could potentially store this in a separate
>database and then look up the ranking information for each document
>found and then perform updates as an external asynchronous task.
>
>Anyone have any experience with maintaining page rankings?
>  
>

It might be of interest to see what Nutch does. It doesn't use pagerank 
but it does seem to care about the # of incoming links. I think the key 
file is IndexSegment ( see the src, not the jdoc).

>
>Thanks,
>
>Scott
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org