You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Homam S.A." <ho...@yahoo.com> on 2004/12/15 03:42:38 UTC

Indexing a large number of DB records

I'm trying to index a large number of records from the
DB (a few millions). Each record will be stored as a
document with about 30 fields, most of them are
UnStored and represent small strings or numbers. No
huge DB Text fields.

But I'm running out of memory very fast, and the
indexing is slowing down to a crawl once I hit around
1500 records. The problem is each document is holding
references to the string objects returned from
ToString() on the DB field, and the IndexWriter is
holding references to all these document objects in
memory, so the garbage collector is getting a chance
to clean these up.

How do you guys go about indexing a large DB table?
Here's a snippet of my code (this method is called for
each record in the DB):

private void IndexRow(SqlDataReader rdr, IndexWriter
iw) {
	Document doc = new Document();
	for (int i = 0; i < BrowseFieldNames.Length; i++) {
		doc.Add(Field.UnStored(BrowseFieldNames[i],
rdr.GetValue(i).ToString()));
	}
	iw.AddDocument(doc);
}




		
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing a large number of DB records

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello Homam,

The batches I was referring to were batches of DB rows.
Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X
LIMIT=Y.

Don't close IndexWriter - use the single instance.

There is no MakeStable()-like method in Lucene, but you can control the
number of in-memory Documents, the frequence of segment merges, and
maximal size of an index segments with 3 IndexWriter parameters,
described fairly verbosely in the javadocs.

Since you are using the .Net version, you should really consult
dotLucene guy(s).  Running under the profiler should also tell you
where the time and memory go.

Otis

--- "Homam S.A." <ho...@yahoo.com> wrote:

> Thanks Otis!
> 
> What do you mean by building it in batches? Does it
> mean I should close the IndexWriter every 1000 rows
> and reopen it? Does that releases references to the
> document objects so that they can be
> garbage-collected?
> 
> I'm calling optimize() only at the end.
> 
> I agree that 1500 documents is very small. I'm
> building the index on a PC with 512 megs, and the
> indexing process is quickly gobbling up around 400
> megs when I index around 1800 documents and the whole
> machine is grinding to a virtual halt. I'm using the
> latest DotLucene .NET port, so may be there's a memory
> leak in it.
> 
> I have experience with AltaVista search (acquired by
> FastSearch), and I used to call MakeStable() every
> 20,000 documents to flush memory structures to disk.
> There doesn't seem to be an equivalent in Lucene.
> 
> -- Homam
> 
> 
> 
> 
> 
> 
> --- Otis Gospodnetic <ot...@yahoo.com>
> wrote:
> 
> > Hello,
> > 
> > There are a few things you can do:
> > 
> > 1) Don't just pull all rows from the DB at once.  Do
> > that in batches.
> > 
> > 2) If you can get a Reader from your SqlDataReader,
> > consider this:
> >
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
> > 
> > 3) Give the JVM more memory to play with by using
> > -Xms and -Xmx JVM
> > parameters
> > 
> > 4) See IndexWriter's minMergeDocs parameter.
> > 
> > 5) Are you calling optimize() at some point by any
> > chance?  Leave that
> > call for the end.
> > 
> > 1500 documents with 30 columns of short
> > String/number values is not a
> > lot.  You may be doing something else not Lucene
> > related that's slowing
> > things down.
> > 
> > Otis
> > 
> > 
> > --- "Homam S.A." <ho...@yahoo.com> wrote:
> > 
> > > I'm trying to index a large number of records from
> > the
> > > DB (a few millions). Each record will be stored as
> > a
> > > document with about 30 fields, most of them are
> > > UnStored and represent small strings or numbers.
> > No
> > > huge DB Text fields.
> > > 
> > > But I'm running out of memory very fast, and the
> > > indexing is slowing down to a crawl once I hit
> > around
> > > 1500 records. The problem is each document is
> > holding
> > > references to the string objects returned from
> > > ToString() on the DB field, and the IndexWriter is
> > > holding references to all these document objects
> > in
> > > memory, so the garbage collector is getting a
> > chance
> > > to clean these up.
> > > 
> > > How do you guys go about indexing a large DB
> > table?
> > > Here's a snippet of my code (this method is called
> > for
> > > each record in the DB):
> > > 
> > > private void IndexRow(SqlDataReader rdr,
> > IndexWriter
> > > iw) {
> > > 	Document doc = new Document();
> > > 	for (int i = 0; i < BrowseFieldNames.Length; i++)
> > {
> > > 		doc.Add(Field.UnStored(BrowseFieldNames[i],
> > > rdr.GetValue(i).ToString()));
> > > 	}
> > > 	iw.AddDocument(doc);
> > > }
> > > 
> > > 
> > > 
> > > 
> > > 		
> > > __________________________________ 
> > > Do you Yahoo!? 
> > > Yahoo! Mail - Find what you need with new enhanced
> > search.
> > > http://info.mail.yahoo.com/mail_250
> > > 
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail:
> > lucene-user-help@jakarta.apache.org
> > > 
> > > 
> > 
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> > lucene-user-help@jakarta.apache.org
> > 
> > 
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Take Yahoo! Mail with you! Get it on your mobile phone. 
> http://mobile.yahoo.com/maildemo 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing a large number of DB records

Posted by "Homam S.A." <ho...@yahoo.com>.

Thanks Otis!

What do you mean by building it in batches? Does it
mean I should close the IndexWriter every 1000 rows
and reopen it? Does that releases references to the
document objects so that they can be
garbage-collected?

I'm calling optimize() only at the end.

I agree that 1500 documents is very small. I'm
building the index on a PC with 512 megs, and the
indexing process is quickly gobbling up around 400
megs when I index around 1800 documents and the whole
machine is grinding to a virtual halt. I'm using the
latest DotLucene .NET port, so may be there's a memory
leak in it.

I have experience with AltaVista search (acquired by
FastSearch), and I used to call MakeStable() every
20,000 documents to flush memory structures to disk.
There doesn't seem to be an equivalent in Lucene.

-- Homam






--- Otis Gospodnetic <ot...@yahoo.com>
wrote:

> Hello,
> 
> There are a few things you can do:
> 
> 1) Don't just pull all rows from the DB at once.  Do
> that in batches.
> 
> 2) If you can get a Reader from your SqlDataReader,
> consider this:
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
> 
> 3) Give the JVM more memory to play with by using
> -Xms and -Xmx JVM
> parameters
> 
> 4) See IndexWriter's minMergeDocs parameter.
> 
> 5) Are you calling optimize() at some point by any
> chance?  Leave that
> call for the end.
> 
> 1500 documents with 30 columns of short
> String/number values is not a
> lot.  You may be doing something else not Lucene
> related that's slowing
> things down.
> 
> Otis
> 
> 
> --- "Homam S.A." <ho...@yahoo.com> wrote:
> 
> > I'm trying to index a large number of records from
> the
> > DB (a few millions). Each record will be stored as
> a
> > document with about 30 fields, most of them are
> > UnStored and represent small strings or numbers.
> No
> > huge DB Text fields.
> > 
> > But I'm running out of memory very fast, and the
> > indexing is slowing down to a crawl once I hit
> around
> > 1500 records. The problem is each document is
> holding
> > references to the string objects returned from
> > ToString() on the DB field, and the IndexWriter is
> > holding references to all these document objects
> in
> > memory, so the garbage collector is getting a
> chance
> > to clean these up.
> > 
> > How do you guys go about indexing a large DB
> table?
> > Here's a snippet of my code (this method is called
> for
> > each record in the DB):
> > 
> > private void IndexRow(SqlDataReader rdr,
> IndexWriter
> > iw) {
> > 	Document doc = new Document();
> > 	for (int i = 0; i < BrowseFieldNames.Length; i++)
> {
> > 		doc.Add(Field.UnStored(BrowseFieldNames[i],
> > rdr.GetValue(i).ToString()));
> > 	}
> > 	iw.AddDocument(doc);
> > }
> > 
> > 
> > 
> > 
> > 		
> > __________________________________ 
> > Do you Yahoo!? 
> > Yahoo! Mail - Find what you need with new enhanced
> search.
> > http://info.mail.yahoo.com/mail_250
> > 
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> > 
> > 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 
> 



		
__________________________________ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing a large number of DB records

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

There are a few things you can do:

1) Don't just pull all rows from the DB at once.  Do that in batches.

2) If you can get a Reader from your SqlDataReader, consider this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)

3) Give the JVM more memory to play with by using -Xms and -Xmx JVM
parameters

4) See IndexWriter's minMergeDocs parameter.

5) Are you calling optimize() at some point by any chance?  Leave that
call for the end.

1500 documents with 30 columns of short String/number values is not a
lot.  You may be doing something else not Lucene related that's slowing
things down.

Otis


--- "Homam S.A." <ho...@yahoo.com> wrote:

> I'm trying to index a large number of records from the
> DB (a few millions). Each record will be stored as a
> document with about 30 fields, most of them are
> UnStored and represent small strings or numbers. No
> huge DB Text fields.
> 
> But I'm running out of memory very fast, and the
> indexing is slowing down to a crawl once I hit around
> 1500 records. The problem is each document is holding
> references to the string objects returned from
> ToString() on the DB field, and the IndexWriter is
> holding references to all these document objects in
> memory, so the garbage collector is getting a chance
> to clean these up.
> 
> How do you guys go about indexing a large DB table?
> Here's a snippet of my code (this method is called for
> each record in the DB):
> 
> private void IndexRow(SqlDataReader rdr, IndexWriter
> iw) {
> 	Document doc = new Document();
> 	for (int i = 0; i < BrowseFieldNames.Length; i++) {
> 		doc.Add(Field.UnStored(BrowseFieldNames[i],
> rdr.GetValue(i).ToString()));
> 	}
> 	iw.AddDocument(doc);
> }
> 
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Mail - Find what you need with new enhanced search.
> http://info.mail.yahoo.com/mail_250
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: Indexing a large number of DB records

Posted by Garrett Heaver <ga...@researchandmarkets.com>.

There was other reasons for my choice of going with a Temp Index - namely I
was having terrible write times to my Live index as it was stored on a
different server, also, while I was writing to my live index people were
trying to search on it and were getting "file not found" exceptions so
rather than spend hours or days trying to fix it I took the easiest route by
creating a temp index on the server which had the application and merging to
the server with the live index. This greatly increased my indexing speed.

Best of luck
Garrett

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: 15 December 2004 18:43
To: Lucene Users List
Subject: RE: Indexing a large number of DB records

Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver <ga...@researchandmarkets.com> wrote:

> Hi Homan
> 
> I had a similar problem as you in that I was indexing A LOT of data
> 
> Essentially how I got round it was to batch the index.
> 
> What I was doing was to add 10,000 documents to a temporary index,
> use
> addIndexes() to merge to temporary index into the live index (which
> also
> optimizes the live index) then delete the temporary index. On the
> next loop
> I'd only query rows from the db above the id in the maxdoc of the
> live index
> and set the max rows of the query to to 10,000
> i.e
> 
> SELECT TOP 10000 [fields] FROM [tables] WHERE [id_field] > {ID from
> Index.MaxDoc()} ORDER BY [id_field] ASC
> 
> Ensuring that the documents go into the index sequentially your
> problem is
> solved and memory usage on mine (dotlucene 1.3) is low
> 
> Regards
> Garrett
> 
> -----Original Message-----
> From: Homam S.A. [mailto:homam_sa@yahoo.com] 
> Sent: 15 December 2004 02:43
> To: Lucene Users List
> Subject: Indexing a large number of DB records
> 
> I'm trying to index a large number of records from the
> DB (a few millions). Each record will be stored as a
> document with about 30 fields, most of them are
> UnStored and represent small strings or numbers. No
> huge DB Text fields.
> 
> But I'm running out of memory very fast, and the
> indexing is slowing down to a crawl once I hit around
> 1500 records. The problem is each document is holding
> references to the string objects returned from
> ToString() on the DB field, and the IndexWriter is
> holding references to all these document objects in
> memory, so the garbage collector is getting a chance
> to clean these up.
> 
> How do you guys go about indexing a large DB table?
> Here's a snippet of my code (this method is called for
> each record in the DB):
> 
> private void IndexRow(SqlDataReader rdr, IndexWriter
> iw) {
> 	Document doc = new Document();
> 	for (int i = 0; i < BrowseFieldNames.Length; i++) {
> 		doc.Add(Field.UnStored(BrowseFieldNames[i],
> rdr.GetValue(i).ToString()));
> 	}
> 	iw.AddDocument(doc);
> }
> 
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Mail - Find what you need with new enhanced search.
> http://info.mail.yahoo.com/mail_250
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: Indexing a large number of DB records

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver <ga...@researchandmarkets.com> wrote:

> Hi Homan
> 
> I had a similar problem as you in that I was indexing A LOT of data
> 
> Essentially how I got round it was to batch the index.
> 
> What I was doing was to add 10,000 documents to a temporary index,
> use
> addIndexes() to merge to temporary index into the live index (which
> also
> optimizes the live index) then delete the temporary index. On the
> next loop
> I'd only query rows from the db above the id in the maxdoc of the
> live index
> and set the max rows of the query to to 10,000
> i.e
> 
> SELECT TOP 10000 [fields] FROM [tables] WHERE [id_field] > {ID from
> Index.MaxDoc()} ORDER BY [id_field] ASC
> 
> Ensuring that the documents go into the index sequentially your
> problem is
> solved and memory usage on mine (dotlucene 1.3) is low
> 
> Regards
> Garrett
> 
> -----Original Message-----
> From: Homam S.A. [mailto:homam_sa@yahoo.com] 
> Sent: 15 December 2004 02:43
> To: Lucene Users List
> Subject: Indexing a large number of DB records
> 
> I'm trying to index a large number of records from the
> DB (a few millions). Each record will be stored as a
> document with about 30 fields, most of them are
> UnStored and represent small strings or numbers. No
> huge DB Text fields.
> 
> But I'm running out of memory very fast, and the
> indexing is slowing down to a crawl once I hit around
> 1500 records. The problem is each document is holding
> references to the string objects returned from
> ToString() on the DB field, and the IndexWriter is
> holding references to all these document objects in
> memory, so the garbage collector is getting a chance
> to clean these up.
> 
> How do you guys go about indexing a large DB table?
> Here's a snippet of my code (this method is called for
> each record in the DB):
> 
> private void IndexRow(SqlDataReader rdr, IndexWriter
> iw) {
> 	Document doc = new Document();
> 	for (int i = 0; i < BrowseFieldNames.Length; i++) {
> 		doc.Add(Field.UnStored(BrowseFieldNames[i],
> rdr.GetValue(i).ToString()));
> 	}
> 	iw.AddDocument(doc);
> }
> 
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Mail - Find what you need with new enhanced search.
> http://info.mail.yahoo.com/mail_250
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: Indexing a large number of DB records

Posted by Garrett Heaver <ga...@researchandmarkets.com>.

Hi Homan

I had a similar problem as you in that I was indexing A LOT of data

Essentially how I got round it was to batch the index.

What I was doing was to add 10,000 documents to a temporary index, use
addIndexes() to merge to temporary index into the live index (which also
optimizes the live index) then delete the temporary index. On the next loop
I'd only query rows from the db above the id in the maxdoc of the live index
and set the max rows of the query to to 10,000
i.e

SELECT TOP 10000 [fields] FROM [tables] WHERE [id_field] > {ID from
Index.MaxDoc()} ORDER BY [id_field] ASC

Ensuring that the documents go into the index sequentially your problem is
solved and memory usage on mine (dotlucene 1.3) is low

Regards
Garrett

-----Original Message-----
From: Homam S.A. [mailto:homam_sa@yahoo.com] 
Sent: 15 December 2004 02:43
To: Lucene Users List
Subject: Indexing a large number of DB records

I'm trying to index a large number of records from the
DB (a few millions). Each record will be stored as a
document with about 30 fields, most of them are
UnStored and represent small strings or numbers. No
huge DB Text fields.

But I'm running out of memory very fast, and the
indexing is slowing down to a crawl once I hit around
1500 records. The problem is each document is holding
references to the string objects returned from
ToString() on the DB field, and the IndexWriter is
holding references to all these document objects in
memory, so the garbage collector is getting a chance
to clean these up.

How do you guys go about indexing a large DB table?
Here's a snippet of my code (this method is called for
each record in the DB):

private void IndexRow(SqlDataReader rdr, IndexWriter
iw) {
	Document doc = new Document();
	for (int i = 0; i < BrowseFieldNames.Length; i++) {
		doc.Add(Field.UnStored(BrowseFieldNames[i],
rdr.GetValue(i).ToString()));
	}
	iw.AddDocument(doc);
}




		
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: A question about scoring function in Lucene

Posted by Vikas Gupta <vg...@cs.utexas.edu>.

Lucene uses the vector space model. To understand that:

-Read section 2.1 of "Space optimizations for Total Ranking" paper (Linked
here http://lucene.sourceforge.net/publications.html)
-Read section 6 to 6.4 of
http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
-Read section 1 of
http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps

Vikas

On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:

> Hi all,
> Lucene score document based on the correlation between
> the query q and document t:
> (this is raw function, I don't pay attention to the
> boost_t, coord_q_d factor)
>
> score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
> / norm_d_t)  (*)
>
> Could anybody explain it in detail ? Or are there any
> papers, documents about this function ? Because:
>
> I have also read the book: Modern Information
> Retrieval, author: Ricardo Baeza-Yates and Berthier
> Ribeiro-Neto, Addison Wesley (Hope you have read it
> too). In page 27, they also suggest a scoring funtion
> for vector model based on the correlation between
> query q and document d as follow (I use different
> symbol):
>
> 	         sum_t( weight_t_d * weight_t_q)
> score_d(d, q)=  --------------------------------- (**)
> 	     	      norm_d * norm_q
>
> where weight_t_d = tf_d * idf_t
>       weight_t_q = tf_q * idf_t
>       norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
>       norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )
>
> (**):          sum_t( tf_q*idf_t * tf_d*idf_t)
> score_d(d, q)=---------------------------------  (***)
> 		   norm_d * norm_q
>
> The two function, (*) and (***), have 2 differences:
> 1. in (***), the sum_t is just for the numerator but
> in the (*), the sum_t is for everything. So, with
> norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
> calculated twice. Is this right? please explain.
>
> 2. No factor that define norms of the document: norm_d
> in the function (*). Can you explain this. what is the
> role of factor norm_d_t ?
>
> One more question: could anybody give me documents,
> papers that explain this function in detail. so when I
> apply Lucene for my system, I can adapt the document,
> and the field so that I still receive the correct
> scoring information from Lucene .
>
> Best regard,
> Thanks every body,
>
> =====
> �#7863;ng Nh�n

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

A question about scoring function in Lucene

Posted by Nhan Nguyen Dang <nd...@yahoo.com>.

Hi all,
Lucene score document based on the correlation between
the query q and document t:
(this is raw function, I don't pay attention to the 
boost_t, coord_q_d factor)

score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
/ norm_d_t)  (*)

Could anybody explain it in detail ? Or are there any
papers, documents about this function ? Because:

I have also read the book: Modern Information
Retrieval, author: Ricardo Baeza-Yates and Berthier 
Ribeiro-Neto, Addison Wesley (Hope you have read it
too). In page 27, they also suggest a scoring funtion
for vector model based on the correlation between
query q and document d as follow (I use different
symbol):

	         sum_t( weight_t_d * weight_t_q) 
score_d(d, q)=  --------------------------------- (**)
	     	      norm_d * norm_q 

where weight_t_d = tf_d * idf_t
      weight_t_q = tf_q * idf_t
      norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
      norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )

(**):          sum_t( tf_q*idf_t * tf_d*idf_t) 
score_d(d, q)=---------------------------------  (***)
		   norm_d * norm_q 

The two function, (*) and (***), have 2 differences:
1. in (***), the sum_t is just for the numerator but
in the (*), the sum_t is for everything. So, with
norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
calculated twice. Is this right? please explain.

2. No factor that define norms of the document: norm_d
in the function (*). Can you explain this. what is the
role of factor norm_d_t ?

One more question: could anybody give me documents,
papers that explain this function in detail. so when I
apply Lucene for my system, I can adapt the document,
and the field so that I still receive the correct
scoring information from Lucene .

Best regard,
Thanks every body,

=====
�#7863;ng Nh�n 





		
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org