You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Maxym Mykhalchuk <ma...@dit.unitn.it> on 2006/04/10 20:46:16 UTC

Small field indexing and ranking

Hi All,

I've tried to search for the topic, but to no avail so far... Sorry if it's been raised before.

Here's the issue: All my "documents" will be having a few (2-3: title, short description) short fields. You see, it's rare that the same word is repeated several times in a title, so will Lucene be able to give me a decent ranking, or will it be able to tell me "oh, yes, this term is in the following 300 titles".

On what I've read on the topic so far, it seems that inverted indexes do work good on big texts, as they are able to exploit the repetition of words to do ranking.

Will it work for small fields?

Sincere,
Maxym
P.S. If you have links to good reading on inverted indexes, I'll appreciate that too.

==================================
Maxym Mykhalchuk
(+39) 320 8593170
PhD student at University of Trento, ITALY
==================================

Re: Small field indexing and ranking

Posted by Daniel Naber <lu...@danielnaber.de>.

On Dienstag 11 April 2006 10:33, Nadav Har'El wrote:

> This sort of proximity-influenced scoring is missing from
> Lucene's QueryParser, and I've been wondering recently
> on how it is best to add it, and whether it is possible to
> easily do it with existing Lucene machinary, like the
> SpanQuery class.

Nutch uses a boost someway like this:

Query: hot dog

Rewritten query: +hot +dog "hot dog"^1000

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Small field indexing and ranking

Posted by Nadav Har'El <NY...@il.ibm.com>.

"Maxym Mykhalchuk" <ma...@dit.unitn.it> wrote on 11/04/2006 11:52:07 AM:
> As for improving multi-word queries, Doug Cutting recently posted a link
to
> his presentation,
> http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf,

> just scroll down to Nutch N-Grams there, and you'll see the answer.
> Basically, "Buffy the Vampire" is indexed as buffy, buffy-the+0, the,
> the-vampire+0, vampire; but that's only for common terms like "the",
"www"
> etc. I wander if it's possible to use the same technique to store
phrases...

The method you propose has been known for a long time, and sometimes called
"Lexical Affinities".
For example, http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf
explains:

   "Lexical affinities (LAs) were first introduced by Saussure in 1947
    to represent the correlation between words co-occurring in a given
    language and then restricted to a given document for IR purposes.
    LAs are identified by looking at pairs of words found in close
    proximity to each other. It has been described elsewhere how LAs,
    when used as indexing units, improve precision of search by
    disambiguating terms."

A paper from SIGIR from as back as 1989
http://widit.slis.indiana.edu/irpub/SIGIR/1989/cite21.htm
makes use of this technique.

However, I'm not sure that this technique is still considered the
best one. It can have a large impact on the index size, and it may
be possible to get similar results with no impact on index size
and just a small run-time slowdown by using something like
SpanNearQuery, or a variation on this idea. Again, I didn't yet
try to do this myself, so I'm not sure how successful that would
be.

--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Small field indexing and ranking

Posted by Maxym Mykhalchuk <ma...@dit.unitn.it>.

Hi Nadav,

Thanks for suggestions.

As for improving multi-word queries, Doug Cutting recently posted a link to 
his presentation, 
http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf, 
just scroll down to Nutch N-Grams there, and you'll see the answer.
Basically, "Buffy the Vampire" is indexed as buffy, buffy-the+0, the, 
the-vampire+0, vampire; but that's only for common terms like "the", "www" 
etc. I wander if it's possible to use the same technique to store phrases...

Maxym

==================================
Maxym Mykhalchuk
(+39) 320 8593170
PhD student at University of Trento, ITALY
==================================
----- Original Message ----- 
From: "Nadav Har'El" <NY...@il.ibm.com>
To: <ja...@lucene.apache.org>
Sent: Tuesday, April 11, 2006 10:33 AM
Subject: Re: Small field indexing and ranking


> "Maxym Mykhalchuk" <ma...@dit.unitn.it> wrote on 10/04/2006 09:46:16 PM:
>> Here's the issue: All my "documents" will be having a few (2-3:
>> title, short description) short fields. You see, it's rare that the
>> same word is repeated several times in a title, so will Lucene be
>> able to give me a decent ranking, or will it be able to tell me "oh,
>> yes, this term is in the following 300 titles".
>>
>> On what I've read on the topic so far, it seems that inverted
>> indexes do work good on big texts, as they are able to exploit the
>> repetition of words to do ranking.
>
> Lucene is no psychic. If you're looking for "dog", and the document
> contains two short documents, actually titles:
>      "Sparky the Fire Dog"
> and   "Dog Hause Home Page"
> (just two silly titles from Google's top 10 results for "dog"...)
> Then there's hardly any way for Lucene to determine which document
> should be ranked higher.
>
> For single word queries in a situation like this, you might want
> to help Lucene learn the "good" ranking. One way is to use
> Document.setBoost() (or Field.setBoost) to pre-determine which
> document is more "important" regardless of its text (e.g.,
> using some sort of link analysis, or whatever trick that is
> applicable in your situation). Another way is to override
> Lucene's relevance ranking with some other type of sorting
> (see the Sort class) - for example, to sort all the matching
> results by date, to get the newer matching results first.
> In many applications, you might want to let your users control
> this sort order; For example, in a shopping site (where product
> names are the very short "documents"), you might want to let
> the user sort the results by price, by popularity, by release
> date, by users' ranking, and so on.
>
> For multi-word queries, it is actually possible to improve
> on Lucene's standard ranking. For example, let's say you
> have the two titles
>      "Hot Dog on a Stick"
>      "Your Dog in Hot Weather"
> And get a query "hot dog" (without quotation marks).
> Using QueryParser, Lucene will normally rank the two titles
> more or less the same. However, the first one is probably
> much better because the words "hot" and "dog", don't just
> appear there, they actually appear very close, and in this
> case even in order.
>
> This sort of proximity-influenced scoring is missing from
> Lucene's QueryParser, and I've been wondering recently
> on how it is best to add it, and whether it is possible to
> easily do it with existing Lucene machinary, like the
> SpanQuery class. Has anyone ever tried to do something
> like this before, and can tell us their experience?
>
> Good Luck,
> Nadav.
>
> --
> Nadav Har'El
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Small field indexing and ranking

Posted by Nadav Har'El <NY...@il.ibm.com>.

"Maxym Mykhalchuk" <ma...@dit.unitn.it> wrote on 10/04/2006 09:46:16 PM:
> Here's the issue: All my "documents" will be having a few (2-3:
> title, short description) short fields. You see, it's rare that the
> same word is repeated several times in a title, so will Lucene be
> able to give me a decent ranking, or will it be able to tell me "oh,
> yes, this term is in the following 300 titles".
>
> On what I've read on the topic so far, it seems that inverted
> indexes do work good on big texts, as they are able to exploit the
> repetition of words to do ranking.

Lucene is no psychic. If you're looking for "dog", and the document
contains two short documents, actually titles:
      "Sparky the Fire Dog"
and   "Dog Hause Home Page"
(just two silly titles from Google's top 10 results for "dog"...)
Then there's hardly any way for Lucene to determine which document
should be ranked higher.

For single word queries in a situation like this, you might want
to help Lucene learn the "good" ranking. One way is to use
Document.setBoost() (or Field.setBoost) to pre-determine which
document is more "important" regardless of its text (e.g.,
using some sort of link analysis, or whatever trick that is
applicable in your situation). Another way is to override
Lucene's relevance ranking with some other type of sorting
(see the Sort class) - for example, to sort all the matching
results by date, to get the newer matching results first.
In many applications, you might want to let your users control
this sort order; For example, in a shopping site (where product
names are the very short "documents"), you might want to let
the user sort the results by price, by popularity, by release
date, by users' ranking, and so on.

For multi-word queries, it is actually possible to improve
on Lucene's standard ranking. For example, let's say you
have the two titles
      "Hot Dog on a Stick"
      "Your Dog in Hot Weather"
And get a query "hot dog" (without quotation marks).
Using QueryParser, Lucene will normally rank the two titles
more or less the same. However, the first one is probably
much better because the words "hot" and "dog", don't just
appear there, they actually appear very close, and in this
case even in order.

This sort of proximity-influenced scoring is missing from
Lucene's QueryParser, and I've been wondering recently
on how it is best to add it, and whether it is possible to
easily do it with existing Lucene machinary, like the
SpanQuery class. Has anyone ever tried to do something
like this before, and can tell us their experience?

Good Luck,
Nadav.

--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org