You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Li Li <fa...@gmail.com> on 2010/05/31 10:48:27 UTC

Question about Field.setOmitTermFreqAndPositions(true)

I read in 'lucene in action" that to save space, we can omit termfreq
and postion information. But as far as I know, lucene's default
scoring model is vsm, which need tf(term,doc) to calcuate score. If
there is no tf saved. Will the relevance score be correct?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about Field.setOmitTermFreqAndPositions(true)

Posted by Li Li <fa...@gmail.com>.

What about TermVector? it says in "lucene in action":

Term vectors are something a mix of between an indexed field and
a stored field. They are similar to a stored field because you can
quickly retrieve all term vector fields for a
given document: term vectors are keyed first by document ID. But then,
they are keyed secondarily by
term, meaning they store a miniature inverted index for that one
document. Unlike a stored field, where
the original String content is stored verbatim, term vectors store the
actual separate terms that were
produced by the Analyzer.This allows you to retrieve all terms, and
the frequency of their occurrence
within the document and sorted in lexicographic order, for a
particular indexed Field of a particular
Document.
  TermVector.YES – record the unique terms that occurred, and their
counts, in each document,
                              but do not store any positions or
offsets information.
  TermVector.WITH_POSITIONS – record the unique terms and their
counts, and also the
                              positions of each occurrence of every
term, but no offsets.
 TermVector.WITH_OFFSETS – record the unique terms and their counts,
with the offsets (start &
                              end character position) of each
occurrence of every term, but no positions.
 TermVector.WITH_POSITIONS_OFFSETS – store unique terms and their
counts, along with
                              positions and offsets.
 TermVector.NO – do not store

I am confused. what's the difference between TermVector and Index?
in an index, we can save postion information and also we can save it
in TermVector.
If I want to support phrase query, I must save position in index. And
if I want to support fast highlighter and similar like this, I have to
save TermVector.
How these information stored?
e.g. there are 2 docs using WhitespaceAnalyzer
1,  it is a good day    good night
2,  you are a good man

The index's data structure seems like:   good -> doc1 2(tf)  3  5; doc2 1(tf) 3
what about termvector?    like?   "lucene in action" says it indexed
first by doc id then term. I can't image it
2010/5/31 Andrzej Bialecki <ab...@getopt.org>:
> On 2010-05-31 10:54, Uwe Schindler wrote:
>> No.
>
> See also LUCENE-2048 (nice round number ;) ).
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about Field.setOmitTermFreqAndPositions(true)

Posted by Michael McCandless <lu...@mikemccandless.com>.

TermVectors are not used for searching; they just store each doc, inverted.

They allow you to retrieve all terms (and optionally their
positions/offsets) for a given document.  But this entails a seek,
per-document, so it's fairly costly.

Highlighters use term vectors because they are a good way to map a
given term back to the start/end offset in the original text; without
them you usually have to re-analyze the text (though, you could also
do highlighting client-side, eg use JS to locate all surface forms for
a given term, and highlight them, for HTML; or ask Acrobat Reader to
similarly highlight terms).

Mike

On Mon, May 31, 2010 at 5:40 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
> On 2010-05-31 10:54, Uwe Schindler wrote:
>> No.
>
> See also LUCENE-2048 (nice round number ;) ).
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about Field.setOmitTermFreqAndPositions(true)

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-05-31 10:54, Uwe Schindler wrote:
> No.

See also LUCENE-2048 (nice round number ;) ).


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Question about Field.setOmitTermFreqAndPositions(true)

Posted by Uwe Schindler <uw...@thetaphi.de>.

No.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Li Li [mailto:fancyerii@gmail.com]
> Sent: Monday, May 31, 2010 10:48 AM
> To: java-user@lucene.apache.org
> Subject: Question about Field.setOmitTermFreqAndPositions(true)
> 
> I read in 'lucene in action" that to save space, we can omit termfreq and
> postion information. But as far as I know, lucene's default scoring model
is
> vsm, which need tf(term,doc) to calcuate score. If there is no tf saved.
Will
> the relevance score be correct?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org