You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Gregor Heinrich <Gr...@igd.fhg.de> on 2003/07/30 12:16:47 UTC

Multiple fields identical terms.

Hi everyone,

my index has a title and an abstract field, both inverted and tokenized.

I would like to have unique term texts in my term enumeration. That is,
across all fields there should be no duplicate term text.

An easy solution would be to only use one field.

But does someone know an alternative way with multiple fields?

Best regards,

Gregor


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Multiple fields identical terms.

Posted by Gregor Heinrich <Gr...@igd.fhg.de>.
Hi.

Thanks for your suggestion; I think the storage overhead is bearable.

Actually I am doing some sort of forward indexing in addition to the
inverted index. I.e., the result will be a meta-search engine that combines
the Lucene IR process proper with an aspect model similar to Latent Semantic
Analysis. To store the forward index, it's necessary to create a
term-document matrix where the terms should all be unique regardsless of the
field. This kind of vector space indexing could as well be useful for other
purposes such as document classification.

One idea is to run an additional Hashtable that checks for uniqueness and
attaches additional information to a term, such as its phonetic encoding or
its catalogization key. But I wanted to use as much of the existing
infrastructure and stay compatible.

I also thought of changing the way how fields and terms are allocated to
each other, i.e., allowing a list of fields in each Term object and thus
make term texts unique. But this would cause a substantial re-design of the
index file and access structure...

Gregor



-----Original Message-----
From: Erik Hatcher [mailto:lists@ehatchersolutions.com]
Sent: Wednesday, July 30, 2003 2:40 PM
To: Lucene Users List
Subject: Re: Multiple fields identical terms.


On Wednesday, July 30, 2003, at 06:16  AM, Gregor Heinrich wrote:
> I would like to have unique term texts in my term enumeration. That is,
> across all fields there should be no duplicate term text.
>
> An easy solution would be to only use one field.
>
> But does someone know an alternative way with multiple fields?

What about putting both abstract and title together into a single new
field called "keywords"?  Leave title and abstract there as well, but
just append the two strings together (with a space in the middle to
tokenize properly! :).

Is that a reasonable alternative?  What are you trying to accomplish?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Multiple fields identical terms.

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
On Wednesday, July 30, 2003, at 06:16  AM, Gregor Heinrich wrote:
> I would like to have unique term texts in my term enumeration. That is,
> across all fields there should be no duplicate term text.
>
> An easy solution would be to only use one field.
>
> But does someone know an alternative way with multiple fields?

What about putting both abstract and title together into a single new 
field called "keywords"?  Leave title and abstract there as well, but 
just append the two strings together (with a space in the middle to 
tokenize properly! :).

Is that a reasonable alternative?  What are you trying to accomplish?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org