You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Taher H. Haveliwala" <ta...@yahoo.com> on 2002/10/13 04:18:34 UTC
indexing documents that arrive in pieces
What is the cleanest way in Lucene to add documents to
an index, if the entire document is not readily
available at one time?
E.g., I want to index the text as well as the
anchor-text of a stream of html pages, where the
anchor-text terms get associated with the page _being
pointed to_. For a document d_i, I don't know all the
terms that should be added to its "anchor" field,
until I've seen all documents d_j that link to d_i.
Of course I can make a pass over the web pages, and
gather up the relevant terms myself, but if Lucene has
the necessary machinery to add portions of a document
at different times, it would save me work.
Thanks
Taher
__________________________________________________
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More
http://faith.yahoo.com
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: Alphabetical sorting of results
Posted by David Birtwell <Da...@vwr.com>.
Eoin,
There is a technique for predifining an ordering of results at index
time that might be applicable here. It involves making slight
modifications to the Lucene source. Here's a summary from another mail
I had written on the subject:
---
I was faced with a similar problem. We wanted to have a numeric rank
field in each document influence the order in which the documents were
returned by lucene. While investigating a solution for this, I wanted
to see if I could implement strict sorting based on this numeric value.
I was able to accomplish this using document boosting, but not without
modifying the lucene source. Our "ranking" field is an integer value
from one to one hundred. I'm not sure if this will help you, but I'll
include a summary of what I did.
In DocumentWriter remove the normalization by field length:
float norm = fieldBoosts[n] *
Similarity.normalizeLength(fieldLengths[n]);
to
float norm = fieldBoosts[n];
In TermScorer and PhraseScorer, modify the score() method to ignore the
lucene base score:
score *= Similarity.decodeNorm(norms[d]);
to
score = Similarity.decodeNorm(norms[d]);
In Similarity.java, make byteToFloat() public.
At index time, use Similarity.byteToFloat() to determine your boost
value as in the following pseudocode:
Document d = new Document();
... add your fields ...
int rank = d.getField("RANK"); (range of rank can be 0 to 255)
float sortVal = Similarity.byteToFloat(rank)
d.setBoost(sortVal)
---
In your situation, perhaps you could define a rank based on the
alphabetic ordering value of your title field. With only 256 discreet
boost values currently available to you, though, you'll probably have to
group your titles alphabetically into buckets.
You also might want to investigate modifying the lucene source to return
the same score for each hit, then index your files in alphabetical
order. I *believe* that, independent of score, lucene will return the
results in the order in which they were indexed.
DaveB
Eoin O'Toole wrote:
> I am indexing documents (about 7 different document types) and must
> display the results alphabetically by title field... which is
> generally not one of the search fields.
>
> Currently I am calling hits.get(i) on each document to find the title,
> and then sorting by title. Sort is fast, but calling hits.get(i) n
> times is too slow beyond about 400 objects... and this approach means
> I have to do a "full scan" of the Hits collection.
>
> Anyone have any suggestions/strategies on solving this? (Or is there
> functionality already in place I have overlooked?)
>
> Thanks for any input,
>
> Eoin
>
>
> --
> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>
>
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: Alphabetical sorting of results
Posted by Peter Carlson <ca...@bookandhammer.com>.
Hi Eoin,
In the contributions area, there is a project called SearchBean which
will handle most of the sorting issues for you. It does a full scan at
startup (for 100K doc about 5-10 seconds) and stores the field to be
sorted in an array. Then it can get access to the sorted field value
much faster then hits.get(i).
I hope this helps
--Peter
On Thursday, October 31, 2002, at 05:54 AM, Eoin O'Toole wrote:
> I am indexing documents (about 7 different document types) and must
> display the results alphabetically by title field... which is
> generally not one of the search fields.
>
> Currently I am calling hits.get(i) on each document to find the title,
> and then sorting by title. Sort is fast, but calling hits.get(i) n
> times is too slow beyond about 400 objects... and this approach means
> I have to do a "full scan" of the Hits collection.
>
> Anyone have any suggestions/strategies on solving this? (Or is there
> functionality already in place I have overlooked?)
>
> Thanks for any input,
>
> Eoin
>
>
> --
> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>
>
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Alphabetical sorting of results
Posted by Eoin O'Toole <eo...@obs.com>.
I am indexing documents (about 7 different document types) and must display
the results alphabetically by title field... which is generally not one of
the search fields.
Currently I am calling hits.get(i) on each document to find the title, and
then sorting by title. Sort is fast, but calling hits.get(i) n times is
too slow beyond about 400 objects... and this approach means I have to do a
"full scan" of the Hits collection.
Anyone have any suggestions/strategies on solving this? (Or is there
functionality already in place I have overlooked?)
Thanks for any input,
Eoin
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: indexing documents that arrive in pieces
Posted by Ype Kingma <yk...@xs4all.nl>.
On Sunday 13 October 2002 04:18, you wrote:
> What is the cleanest way in Lucene to add documents to
> an index, if the entire document is not readily
> available at one time?
>
> E.g., I want to index the text as well as the
> anchor-text of a stream of html pages, where the
> anchor-text terms get associated with the page _being
> pointed to_. For a document d_i, I don't know all the
> terms that should be added to its "anchor" field,
> until I've seen all documents d_j that link to d_i.
Mr. Codd would normalize the anchor texts as an attribute
of the link from the pointing page to the pointed page.
So a clean way is to store these links in a separate (lucene) db,
putting the anchor text in a stored field so it can be retrieved when
needed.
Since lucene is a free format database, you can store these links
in any lucene db. It depends on the circumstances (ie. what operation
is most time critical) what the best place is.
This also depends on how you want to search your anchor fields: should
they be in the same lucene document as the pointed to page, or could
you just allow searching for anchors in a separate 'links' db?
Have fun,
Ype
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>