You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Markus Fischer <ma...@fischer.name> on 2006/02/26 11:20:44 UTC

Different fields in the same and index and query boosting

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

1) Many different fields because of different projects?

I'm accessing Lucenes index on a remote server through XML-RPC and I'ld
like to use one index for completely independent search projects. The
number of query requests is not that high (just a few per minute) so
from performance wise I wouldn't see a problem.

When adding documents to an index in this case I would add e.g. 5000
documents with fields like "name", "author", "excerpt" and then also a
batch of 2000 files with other fields like "title", "front", "back".

Before adding a batch of thousand files for one project I would first
remove them. Common for all projects would be one "key" field by which I
would differentiate the projects, the query terms are so crafted that
key is always set as an AND field. They key has values like "project1"
and "project2".

Now my concern is: the more projects I add the more different fields
would come into play. I would not recreate the index from scratch as I'm
doing right now but I would only remove all documents with e.g. key
"project1" and add the new documents completely but not touching other
projects.

- From what I read from the documentation, as long as I would call
optimized() after every batch of change it should be ok.

2) Query boosting
Currently I was using query boosting extensive for the headings in HTML
documents, e.g. title:(term)^8 h1:(term)^7 ... h6:(term)^2
content:(term)^1 . I was wondering if this is actually necessary. The
number of existing h1 to h6 fields with content decreases with the
amount of documents. To give the fields title and h1, which are the most
used ones anyway, the highest importance, to I need the boost factor
here anyway or can I avoid them?

thanks for any answers,
- - Markus
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEAYD71nS0RcInK9ARAlA4AJ9hhocglnke6fGthZpjeKeoZA6dfACeJo/6
i8n5ZhbuY/ujzyT8vRY7vpg=
=GWZ8
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Different fields in the same and index and query boosting

Posted by Chris Hostetter <ho...@fucit.org>.

: Now my concern is: the more projects I add the more different fields
: would come into play. I would not recreate the index from scratch as I'm
: doing right now but I would only remove all documents with e.g. key
: "project1" and add the new documents completely but not touching other
: projects.

this should work ok, but one of the things you'll want to watch out for is
the fieldNorms.

Anytime you add a document with an index field, lucene by default creates
a "fieldNorm" value for *every* document in your index -- even if those
documents don't have that field.

if you are only planning on having a handfull of document types, and only
a handfull of indexed fields per doctype, then this shouldn't be a big
deal -- but if you wnat to have thousands of indexed fields per doctype,
or thousands of doctypes -- you may find your index size growing a lot
faster then you expected.

there is an option in 1.9 that lets you specify when adding an indexed
field to a document that you don't want to bother storing a fieldNorm for
that field -- if you do this length normalization won't be possible for
queries on that field, and index time field/document boosts won't work --
but if you aren't concerned about those things, it will help keep your
index size managable.

: Currently I was using query boosting extensive for the headings in HTML
: documents, e.g. title:(term)^8 h1:(term)^7 ... h6:(term)^2
: content:(term)^1 . I was wondering if this is actually necessary. The
: number of existing h1 to h6 fields with content decreases with the
: amount of documents. To give the fields title and h1, which are the most
: used ones anyway, the highest importance, to I need the boost factor
: here anyway or can I avoid them?

you should try some queries like "title:term content:term" and look at the
explain output on your matches to see how much of an impact on the final
score the various matches on title vs content have ... if there are a lot
less terms indexed in the title field then in the content field you should
see the match on title be more significant, and then you can decide how
much boost you want to give if it's not significant enough.

my question do you would be: why do you wnat to avoid using query
time boosts?  there's really no harm in using them, under the coveres an
implicit boost of 1.0f is used for every Query class (that i can think of)
so specifying your own boost value doesn't really affect the performance
of the query if that's what you are concerned about.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org