You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Artem Lokotosh <ar...@gmail.com> on 2011/12/23 18:26:45 UTC

Storing only unique terms in index

Hi, all

I have catchall "text" field, and use it for searching.This field
stores the non-unique terms. For example, this field stores the
following terms:test test searchIs it possible to store non-unique
terms in the following way: "term"|"number of terms", i.e. test|2
search?
I guess it should reduce the size of index

And if yes - is it possible to use this number of terms when
calculating the relevance?

-- 
Best regards,
Artem Lokotosh        mailto:arconen@gmail.com

Re: Storing only unique terms in index

Posted by Chris Hostetter <ho...@fucit.org>.
: I have catchall "text" field, and use it for searching.This field
: stores the non-unique terms. For example, this field stores the
: following terms:test test searchIs it possible to store non-unique
: terms in the following way: "term"|"number of terms", i.e. test|2
: search?
: I guess it should reduce the size of index
: 
: And if yes - is it possible to use this number of terms when
: calculating the relevance?

what you are describing is exactly how an inverted index like Lucene/Solr 
works -- the original raw text can optionally be "stored" for retrieval, 
but the index that is *searched* contains each term a single time, along 
with pointers refering to which documents and where in those documents the 
term exists.  the number of times a term exists in a document is the term 
frequency (or "tf") and is one of the two primary components used in 
the basic scoring formula (TF/IDF)

https://lucene.apache.org/java/3_5_0/fileformats.html
https://en.wikipedia.org/wiki/Tf%E2%80%93idf



-Hoss