You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by petite_abeille <pe...@mac.com> on 2002/08/03 00:44:52 UTC

text format and scoring

Hello,

I was wandering what would be a good way to incorporate text format 
information in Lucene word/document scoring. For example, when turning 
HTML into plain text for indexing purpose, a lot of potentially useful 
information are lost: eg tags like <bold>, <strong> and so on could be 
understood as conveying emphasis information about some words. If 
somebody took the pain to "underline" some words, why throw it away? 
Assuming there is some interesting meaning in a document format/layout, 
and a way to understand it and weight it, how could one incorporate this 
information into document scoring?

Thanks for any insights :-)

PA.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: text format and scoring

Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.

On Sat, 3 Aug 2002, petite_abeille wrote:

> I was wandering what would be a good way to incorporate text format 
> information in Lucene word/document scoring. For example, when turning 
> HTML into plain text for indexing purpose, a lot of potentially useful 
> information are lost: eg tags like <bold>, <strong> and so on could be 
> understood as conveying emphasis information about some words. If 
> somebody took the pain to "underline" some words, why throw it away? 
> Assuming there is some interesting meaning in a document format/layout, 
> and a way to understand it and weight it, how could one incorporate this 
> information into document scoring?

If you can boost terms as they are indexed (I can't remember if this is
possible, but you can certainly do so on queries) then that might be a
good way of doing it; it's not so much a matter of changing document
scores (on the back end, with respect to a particular query) as it is of
changing the weighting of terms (on the front end).

I've just glanced through the API and I don't see a way to do term
boosting during indexing, but maybe there's something I've missed.  
Anyone?

Regards,

Joshua O'Madadhain

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: text format and scoring

Posted by petite_abeille <pe...@mac.com>.

Hi Alex,

On Saturday, August 3, 2002, at 11:13 , Alex Murzaku wrote:

> Hi PA! How are things going?

Doing all right :-)

>
> It's an interesting question but I don't think Lucene
> (as it is today) could change weights based on
> semantics (either assigned by formatting tags or maybe
> looked up in some dictionary like WordNet)...

Ummm... I see.

>
> Some time ago, Doug sent to this list the formula for
> the score computation which is:

Thanks.

> The only thing that counts is the frequency of the
> terms in the document and among documents.
>
> A way to influence the final score might be to tweak
> the real frequencies during indexing with some
> parameters configured externally. Let's say if the
> word is underlined then multiply its count by X. This
> modified TF should influence the final score
> accordingly.
>
> Just a thought...

I see. That's what I'm basically doing right now somehow: I index a 
document multiple time (eg an email could be indexed by subject, first 
sentence and body content). Then I do multiple searches. And use a 
"ranking comparator" to evaluate the result based on how many time I get 
a specific document plus its Lucene scores and other funky heuristics. 
Which seems to work ok, but is kind of cumbersome :-( Same deal for 
finding "related" document. Lucene is very good for finding "similar" 
document, but for "related" (think "cluster" ;-), I basically end up 
doing some term categorization and assign some multiplying factor for 
each term category. Which then I feed to Lucene to get something more 
akin to a "cluster" of document...

In any case, I was simply wandering if there was a more straightforward 
way of doing things.

Cheers,

PA.

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: text format and scoring

Posted by Alex Murzaku <mu...@yahoo.com>.

Hi PA! How are things going?

It's an interesting question but I don't think Lucene
(as it is today) could change weights based on
semantics (either assigned by formatting tags or maybe
looked up in some dictionary like WordNet)...

Some time ago, Doug sent to this list the formula for
the score computation which is:

  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t
/ norm_d_t * boost_t) * coord_q_d

  score_d   : score for document d
  sum_t     : sum for all terms t
  tf_q      : the square root of the frequency of t in
the query
  tf_d      : the square root of the frequency of t in
d
  idf_t     : log(numDocs/docFreq_t+1) + 1.0
  numDocs   : number of documents in index
  docFreq_t : number of documents containing t
  norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
  norm_d_t  : square root of number of tokens in d in
the same field as t
  boost_t    : the user-specified boost for term t
  coord_q_d  : number of terms in both query and
document / number of terms in query

The only thing that counts is the frequency of the
terms in the document and among documents. 

A way to influence the final score might be to tweak
the real frequencies during indexing with some
parameters configured externally. Let's say if the
word is underlined then multiply its count by X. This
modified TF should influence the final score
accordingly.

Just a thought...

Alex

--- petite_abeille <pe...@mac.com> wrote:
> Hello,
> 
> I was wandering what would be a good way to
> incorporate text format 
> information in Lucene word/document scoring. For
> example, when turning 
> HTML into plain text for indexing purpose, a lot of
> potentially useful 
> information are lost: eg tags like <bold>, <strong>
> and so on could be 
> understood as conveying emphasis information about
> some words. If 
> somebody took the pain to "underline" some words,
> why throw it away? 
> Assuming there is some interesting meaning in a
> document format/layout, 
> and a way to understand it and weight it, how could
> one incorporate this 
> information into document scoring?
> 
> Thanks for any insights :-)
> 
> PA.
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 

__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>