You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Donna L Gresh <gr...@us.ibm.com> on 2007/03/20 19:30:44 UTC

Obtaining the (indexed) terms in a field in a particular document

My apologies if this is a simple question--

How can I get all the (stemmed and stop words removed, etc.) terms in a 
particular field of a particular document?

Suppose my documents each consist of two fields, one with the name "my_id" 
and a unique identifier, and the other being some text string consisting
of a number of words.
I'd like to get all the terms in the text string given the unique 
identifier.

(My basic reason is to do a sort of document similarity between the text 
string and some other text string, doing a boolean query with
a number of SHOULD clauses, if this makes sense; I'm welcome to 
suggestions of better ways to do this)

Donna L. Gresh

Re: Obtaining the (indexed) terms in a field in a particular document

Posted by Erick Erickson <er...@gmail.com>.

Well, depending upon your storage requirements, it's actually
much easier than that. Assuming you're adding
this field (or a duplicate) as UN_TOKENIZED (in this case, no
need to store), you can just spin
through all the terms for that field with TermDocs/TermEnum.
The trick is to have your term start with a value of "". I.e.
new Term(field, "") to enumerate them all. See TermDocs.seek.

This is without TermVectors at all, which'll save you some space.

Erick

On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
>
> Thanks, I see what you are saying.
>
> Seems that if I create the field at index time with term vectors stored,
> then I can iterate through the documents and get both the unique
> identifier and the terms, right? My original question was imprecise in
> that I'm going to want to get all the terms for *all* the documents (one
> document at a time) so I can just iterate through all the documents using
>
>                 for (int i=0; i<indexReaderR.numDocs(); i++) {
>                         TermFreqVector tfv =
> indexReaderR.getTermFreqVector(i,"my text field name");
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> gresh@us.ibm.com
>
>
>
>
> "Erick Erickson" <er...@gmail.com>
> 03/20/2007 03:08 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Obtaining the (indexed) terms in a field in a particular document
>
>
>
>
>
>
> Sorry, but you have to have the Lucene document ID, which you
> can get either as part of a Hits or HitCollector or...
> or by using TermDocs/TermEnum on your unique id (my_id in
> your example).
>
> Erick
>
> On 3/20/07, Erick Erickson <er...@gmail.com> wrote:
> >
> > You can do a document.get(field), *assuming* you have stored the data
> > (Field.Store.YES) at index time, although you may not get
> > stop words.
> >
> > On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
> > >
> > > My apologies if this is a simple question--
> > >
> > > How can I get all the (stemmed and stop words removed, etc.) terms in
> a
> > > particular field of a particular document?
> > >
> > > Suppose my documents each consist of two fields, one with the name
> > > "my_id"
> > > and a unique identifier, and the other being some text string
> consisting
> > > of a number of words.
> > > I'd like to get all the terms in the text string given the unique
> > > identifier.
> > >
> > > (My basic reason is to do a sort of document similarity between the
> text
> > >
> > > string and some other text string, doing a boolean query with
> > > a number of SHOULD clauses, if this makes sense; I'm welcome to
> > > suggestions of better ways to do this)
> > >
> > > Donna L. Gresh
> > >
> >
> >
>
>

Re: Obtaining the (indexed) terms in a field in a particular document

Posted by Donna L Gresh <gr...@us.ibm.com>.

Thanks, I see what you are saying.

Seems that if I create the field at index time with term vectors stored, 
then I can iterate through the documents and get both the unique 
identifier and the terms, right? My original question was imprecise in 
that I'm going to want to get all the terms for *all* the documents (one 
document at a time) so I can just iterate through all the documents using

                for (int i=0; i<indexReaderR.numDocs(); i++) {
                        TermFreqVector tfv = 
indexReaderR.getTermFreqVector(i,"my text field name");

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com

"Erick Erickson" <er...@gmail.com> 
03/20/2007 03:08 PM
Please respond to
java-user@lucene.apache.org

To
java-user@lucene.apache.org
cc

Subject
Re: Obtaining the (indexed) terms in a field in a particular document

Sorry, but you have to have the Lucene document ID, which you
can get either as part of a Hits or HitCollector or...
or by using TermDocs/TermEnum on your unique id (my_id in
your example).

Erick

On 3/20/07, Erick Erickson <er...@gmail.com> wrote:
>
> You can do a document.get(field), *assuming* you have stored the data
> (Field.Store.YES) at index time, although you may not get
> stop words.
>
> On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
> >
> > My apologies if this is a simple question--
> >
> > How can I get all the (stemmed and stop words removed, etc.) terms in 
a
> > particular field of a particular document?
> >
> > Suppose my documents each consist of two fields, one with the name
> > "my_id"
> > and a unique identifier, and the other being some text string 
consisting
> > of a number of words.
> > I'd like to get all the terms in the text string given the unique
> > identifier.
> >
> > (My basic reason is to do a sort of document similarity between the 
text
> >
> > string and some other text string, doing a boolean query with
> > a number of SHOULD clauses, if this makes sense; I'm welcome to
> > suggestions of better ways to do this)
> >
> > Donna L. Gresh
> >
>
>

Re: Obtaining the (indexed) terms in a field in a particular document

Posted by Erick Erickson <er...@gmail.com>.

Sorry, but you have to have the Lucene document ID, which you
can get either as part of a Hits or HitCollector or...
or by using TermDocs/TermEnum on your unique id (my_id in
your example).

Erick

On 3/20/07, Erick Erickson <er...@gmail.com> wrote:
>
> You can do a document.get(field), *assuming* you have stored the data
> (Field.Store.YES) at index time, although you may not get
> stop words.
>
> On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
> >
> > My apologies if this is a simple question--
> >
> > How can I get all the (stemmed and stop words removed, etc.) terms in a
> > particular field of a particular document?
> >
> > Suppose my documents each consist of two fields, one with the name
> > "my_id"
> > and a unique identifier, and the other being some text string consisting
> > of a number of words.
> > I'd like to get all the terms in the text string given the unique
> > identifier.
> >
> > (My basic reason is to do a sort of document similarity between the text
> >
> > string and some other text string, doing a boolean query with
> > a number of SHOULD clauses, if this makes sense; I'm welcome to
> > suggestions of better ways to do this)
> >
> > Donna L. Gresh
> >
>
>

Re: Obtaining the (indexed) terms in a field in a particular document

Posted by Erick Erickson <er...@gmail.com>.

You can do a document.get(field), *assuming* you have stored the data
(Field.Store.YES) at index time, although you may not get
stop words.

On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
>
> My apologies if this is a simple question--
>
> How can I get all the (stemmed and stop words removed, etc.) terms in a
> particular field of a particular document?
>
> Suppose my documents each consist of two fields, one with the name "my_id"
> and a unique identifier, and the other being some text string consisting
> of a number of words.
> I'd like to get all the terms in the text string given the unique
> identifier.
>
> (My basic reason is to do a sort of document similarity between the text
> string and some other text string, doing a boolean query with
> a number of SHOULD clauses, if this makes sense; I'm welcome to
> suggestions of better ways to do this)
>
> Donna L. Gresh
>