You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by charlie w <sp...@gmail.com> on 2007/05/18 22:01:43 UTC

documents with large numbers of fields

Hi all,

I am trying to create an index where I can apply specific boost values to
documents, depending on the notion of something like a multiple tags for
documents.  Each tag would wind up having a different boost value in each
document.

My original approach was to add many fields with the same name, and the
value of each field would be the tag.  I applied a different boost to each
tag.  So something like:
tag=foo        ^2
tag=bar       ^1.2
tag=foobar   ^1.8
searching "tag:bar", for example.  Different documents might have different
boost values for "tag=bar", influencing the scoring.  A document might have
hundreds of these tags.  I might be searching for hundreds of different
values in these tag fields.

I have discovered this doesn't work, seemingly because of the way
DefaultSimilarity creates the fieldNorm.  It seems like internally Lucene is
effectively combining these into one tag field, with 3 terms and a boost
value that is the product of each of the 3 boost values.  If id did indeed
have a document with many many of these tag fields I'd wind up with a
freakishly huge boost on that document.

So now I have the idea to invert the field name and value thusly:
foo=tag     ^2
bar=tag     ^1.2
foobar=tag    ^1.8
and search "foo:tag".

Intuitively, I would expect Lucene to be optimized for searching the values
of fields, and not really the names of fields.  In a somewhat large index,
say 10 million documents, will Lucene search performance continue to be
acceptable if I load up documents with many fields like this?

Is there an upper limit on the number of fields comprising a document, and
if so what is it?

Or, is there some way to make my original approach work after all?

Regards and thanks,
Charlie

Re: documents with large numbers of fields

Posted by Steven Rowe <sa...@syr.edu>.
Mike Klaas wrote:
> On 18-May-07, at 1:01 PM, charlie w wrote:
>> Is there an upper limit on the number of fields comprising a document,
>> and if so what is it?
> 
> There is not.  They are relatively costless if omitNorms=False

Mike, I think you meant "relatively costless if omitNorms=True".

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: documents with large numbers of fields

Posted by Mike Klaas <mi...@gmail.com>.
On 18-May-07, at 1:01 PM, charlie w wrote:
> So now I have the idea to invert the field name and value thusly:
> foo=tag     ^2
> bar=tag     ^1.2
> foobar=tag    ^1.8
> and search "foo:tag".
>
> Intuitively, I would expect Lucene to be optimized for searching  
> the values
> of fields, and not really the names of fields.  In a somewhat large  
> index,
> say 10 million documents, will Lucene search performance continue  
> to be
> acceptable if I load up documents with many fields like this?

Perhaps not.  Storing a field with norms occupies O(N) space,  
regardless of the number of document with non-zero norms.  There  
might be too much data for the os to cache and lucene to process  
efficiently.

> Is there an upper limit on the number of fields comprising a  
> document, and
> if so what is it?

There is not.  They are relatively costless if omitNorms=False

> Or, is there some way to make my original approach work after all?

The experimental Payloads allows an optional boost to be stored along  
with term position.  This is the intended use case.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org