You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sol myr <so...@yahoo.com> on 2011/10/04 11:46:12 UTC

[Lucene] Frequencies and positions - are they stored per field?


Hi,

I use Lucene, but an not familiar with its internals. 
I'd appreciate help understanding whether Term Frequences and Positions - are stored  per Document of per Field?
On the one hand, I never ask for "Field.TermVector" because I read it's only required for "MoreLikeThis" (which I don't need).
On the other hand, my searches *are* based on fields...

Here's my code:
// Write (without Field.TermVector):

Document doc=new Document();
doc.add(new Field("subject",  "Requisition request", Store.YES, Index.ANALYZED));
doc.add(new Field("body",  "Attached is an Urgent requisition request", Store.YES, Index.ANALYZED));
write.addDocument(doc);

// And my Query:
Query query=parser.parse("subject : urgent");

Now how does Lucene manage this query?
I asked it to search the "subject" Field.
But if the "inverted index" doesn't keep fields, it would only remember that "The term 'Urgent' appears in SOME FIELD of document#1 "...
Isn't it true?

If so, how would it make sure to retrieve only documents that match in the Subject ?

Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: [Lucene] Frequencies and positions - are they stored per field?

Posted by sol myr <so...@yahoo.com>.
Thanks so much, this helped a lot :)



----- Original Message -----
From: Uwe Schindler <uw...@thetaphi.de>
To: java-user@lucene.apache.org; 'sol myr' <so...@yahoo.com>
Cc: 
Sent: Tuesday, October 4, 2011 12:14 PM
Subject: RE: [Lucene] Frequencies and positions - are they stored per field?

Hi,

Term Vectors are somehow duplicate information. It is used to get quickly *per document* all vectors for *one field*. This means you get the positions, offsets, and frequencies for the requested document as one blob like a stored field that can be used e.g. for more like this or highlighting (FastVectorHighligter also needs term vectors).

It's identical to the difference between indexed fields and stored field (in fact the information stored if you enable TermVectors during indexing is similar to stored fields, see it like a binary stored field containing all vectors for the corresponding document).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: sol myr [mailto:solmyr72@yahoo.com]
> Sent: Tuesday, October 04, 2011 12:08 PM
> To: java-user@lucene.apache.org
> Subject: Re: [Lucene] Frequencies and positions - are they stored per field?
> 
> Thanks a lot.
> But then what's the added value of Field.TermVector?
> 
> Can't it be deduced from the overall Lucene index? Or is it just inefficient to
> deduce?
> 
> Thanks again :)
> 
> 
> 
> ----- Original Message -----
> From: Uwe Schindler <uw...@thetaphi.de>
> To: java-user@lucene.apache.org; 'sol myr' <so...@yahoo.com>
> Cc:
> Sent: Tuesday, October 4, 2011 11:53 AM
> Subject: RE: [Lucene] Frequencies and positions - are they stored per field?
> 
> Lucene always uses a field, a query using a term without a field is impossible.
> See each field as a parallel inverted index; all statistics are per field, too. If you
> pass a query without a field name to QueryParser it will chose the default field,
> that’s given when creating the QueryParser.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: sol myr [mailto:solmyr72@yahoo.com]
> > Sent: Tuesday, October 04, 2011 11:46 AM
> > To: lucene
> > Subject: [Lucene] Frequencies and positions - are they stored per field?
> >
> >
> >
> > Hi,
> >
> > I use Lucene, but an not familiar with its internals.
> > I'd appreciate help understanding whether Term Frequences and
> > Positions -
> are
> > stored  per Document of per Field?
> > On the one hand, I never ask for "Field.TermVector" because I read
> > it's
> only
> > required for "MoreLikeThis" (which I don't need).
> > On the other hand, my searches *are* based on fields...
> >
> > Here's my code:
> > // Write (without Field.TermVector):
> >
> > Document doc=new Document();
> > doc.add(new Field("subject",  "Requisition request", Store.YES,
> > Index.ANALYZED)); doc.add(new Field("body",  "Attached is an Urgent
> > requisition request", Store.YES, Index.ANALYZED));
> > write.addDocument(doc);
> >
> > // And my Query:
> > Query query=parser.parse("subject : urgent");
> >
> > Now how does Lucene manage this query?
> > I asked it to search the "subject" Field.
> > But if the "inverted index" doesn't keep fields, it would only
> > remember
> that
> > "The term 'Urgent' appears in SOME FIELD of document#1 "...
> > Isn't it true?
> >
> > If so, how would it make sure to retrieve only documents that match in
> > the Subject ?
> >
> > Thanks.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: [Lucene] Frequencies and positions - are they stored per field?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

Term Vectors are somehow duplicate information. It is used to get quickly *per document* all vectors for *one field*. This means you get the positions, offsets, and frequencies for the requested document as one blob like a stored field that can be used e.g. for more like this or highlighting (FastVectorHighligter also needs term vectors).

It's identical to the difference between indexed fields and stored field (in fact the information stored if you enable TermVectors during indexing is similar to stored fields, see it like a binary stored field containing all vectors for the corresponding document).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: sol myr [mailto:solmyr72@yahoo.com]
> Sent: Tuesday, October 04, 2011 12:08 PM
> To: java-user@lucene.apache.org
> Subject: Re: [Lucene] Frequencies and positions - are they stored per field?
> 
> Thanks a lot.
> But then what's the added value of Field.TermVector?
> 
> Can't it be deduced from the overall Lucene index? Or is it just inefficient to
> deduce?
> 
> Thanks again :)
> 
> 
> 
> ----- Original Message -----
> From: Uwe Schindler <uw...@thetaphi.de>
> To: java-user@lucene.apache.org; 'sol myr' <so...@yahoo.com>
> Cc:
> Sent: Tuesday, October 4, 2011 11:53 AM
> Subject: RE: [Lucene] Frequencies and positions - are they stored per field?
> 
> Lucene always uses a field, a query using a term without a field is impossible.
> See each field as a parallel inverted index; all statistics are per field, too. If you
> pass a query without a field name to QueryParser it will chose the default field,
> that’s given when creating the QueryParser.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: sol myr [mailto:solmyr72@yahoo.com]
> > Sent: Tuesday, October 04, 2011 11:46 AM
> > To: lucene
> > Subject: [Lucene] Frequencies and positions - are they stored per field?
> >
> >
> >
> > Hi,
> >
> > I use Lucene, but an not familiar with its internals.
> > I'd appreciate help understanding whether Term Frequences and
> > Positions -
> are
> > stored  per Document of per Field?
> > On the one hand, I never ask for "Field.TermVector" because I read
> > it's
> only
> > required for "MoreLikeThis" (which I don't need).
> > On the other hand, my searches *are* based on fields...
> >
> > Here's my code:
> > // Write (without Field.TermVector):
> >
> > Document doc=new Document();
> > doc.add(new Field("subject",  "Requisition request", Store.YES,
> > Index.ANALYZED)); doc.add(new Field("body",  "Attached is an Urgent
> > requisition request", Store.YES, Index.ANALYZED));
> > write.addDocument(doc);
> >
> > // And my Query:
> > Query query=parser.parse("subject : urgent");
> >
> > Now how does Lucene manage this query?
> > I asked it to search the "subject" Field.
> > But if the "inverted index" doesn't keep fields, it would only
> > remember
> that
> > "The term 'Urgent' appears in SOME FIELD of document#1 "...
> > Isn't it true?
> >
> > If so, how would it make sure to retrieve only documents that match in
> > the Subject ?
> >
> > Thanks.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: [Lucene] Frequencies and positions - are they stored per field?

Posted by sol myr <so...@yahoo.com>.
Thanks a lot.
But then what's the added value of Field.TermVector?

Can't it be deduced from the overall Lucene index? Or is it just inefficient to deduce?

Thanks again :)



----- Original Message -----
From: Uwe Schindler <uw...@thetaphi.de>
To: java-user@lucene.apache.org; 'sol myr' <so...@yahoo.com>
Cc: 
Sent: Tuesday, October 4, 2011 11:53 AM
Subject: RE: [Lucene] Frequencies and positions - are they stored per field?

Lucene always uses a field, a query using a term without a field is
impossible. See each field as a parallel inverted index; all statistics are
per field, too. If you pass a query without a field name to QueryParser it
will chose the default field, that’s given when creating the QueryParser.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: sol myr [mailto:solmyr72@yahoo.com]
> Sent: Tuesday, October 04, 2011 11:46 AM
> To: lucene
> Subject: [Lucene] Frequencies and positions - are they stored per field?
> 
> 
> 
> Hi,
> 
> I use Lucene, but an not familiar with its internals.
> I'd appreciate help understanding whether Term Frequences and Positions -
are
> stored  per Document of per Field?
> On the one hand, I never ask for "Field.TermVector" because I read it's
only
> required for "MoreLikeThis" (which I don't need).
> On the other hand, my searches *are* based on fields...
> 
> Here's my code:
> // Write (without Field.TermVector):
> 
> Document doc=new Document();
> doc.add(new Field("subject",  "Requisition request", Store.YES,
> Index.ANALYZED)); doc.add(new Field("body",  "Attached is an Urgent
> requisition request", Store.YES, Index.ANALYZED)); write.addDocument(doc);
> 
> // And my Query:
> Query query=parser.parse("subject : urgent");
> 
> Now how does Lucene manage this query?
> I asked it to search the "subject" Field.
> But if the "inverted index" doesn't keep fields, it would only remember
that
> "The term 'Urgent' appears in SOME FIELD of document#1 "...
> Isn't it true?
> 
> If so, how would it make sure to retrieve only documents that match in the
> Subject ?
> 
> Thanks.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: [Lucene] Frequencies and positions - are they stored per field?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Lucene always uses a field, a query using a term without a field is
impossible. See each field as a parallel inverted index; all statistics are
per field, too. If you pass a query without a field name to QueryParser it
will chose the default field, that’s given when creating the QueryParser.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: sol myr [mailto:solmyr72@yahoo.com]
> Sent: Tuesday, October 04, 2011 11:46 AM
> To: lucene
> Subject: [Lucene] Frequencies and positions - are they stored per field?
> 
> 
> 
> Hi,
> 
> I use Lucene, but an not familiar with its internals.
> I'd appreciate help understanding whether Term Frequences and Positions -
are
> stored  per Document of per Field?
> On the one hand, I never ask for "Field.TermVector" because I read it's
only
> required for "MoreLikeThis" (which I don't need).
> On the other hand, my searches *are* based on fields...
> 
> Here's my code:
> // Write (without Field.TermVector):
> 
> Document doc=new Document();
> doc.add(new Field("subject",  "Requisition request", Store.YES,
> Index.ANALYZED)); doc.add(new Field("body",  "Attached is an Urgent
> requisition request", Store.YES, Index.ANALYZED)); write.addDocument(doc);
> 
> // And my Query:
> Query query=parser.parse("subject : urgent");
> 
> Now how does Lucene manage this query?
> I asked it to search the "subject" Field.
> But if the "inverted index" doesn't keep fields, it would only remember
that
> "The term 'Urgent' appears in SOME FIELD of document#1 "...
> Isn't it true?
> 
> If so, how would it make sure to retrieve only documents that match in the
> Subject ?
> 
> Thanks.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org