You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by tierecke <ni...@gmail.com> on 2007/08/03 11:18:24 UTC

Get the TokenStream of an indexed but unstored field

Hi,

I indexed a large number of large documents, but I did not index the
document themselves.
Now I am interested in getting the vector (i.e.: the terms indexed and the
frequency) of that indexed but unstored field.
doc.getField (fieldname) returns null.
How can I get the data? It must be there, since it's a part of the index, or
am I wrong?

Would be grateful for a quick result (need to submit data for a conference
this weekend).
thanks,
Nir.
-- 
View this message in context: http://www.nabble.com/Get-the-TokenStream-of-an-indexed-but-unstored-field-tf4211430.html#a11980001
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Get the terms and frequency vector of an indexed but unstored field

Posted by Karl Wettin <ka...@gmail.com>.

6 nov 2007 kl. 09.51 skrev Shailendra Mudgal:

> Hi,
> If while indexing we have not set this flag, then is there any  
> other way to
> get this info, i mean the TermFreqVector for a document ??

See TermVectorAccessor in JIRA.

http://issues.apache.org/jira/browse/LUCENE-1016

The highligher also has some ad hoc code for extracting the data from  
the inverted index using TermEnum and TermDocs. It can however take  
quite some time.

-- 
karl


>
>
>
> On 8/3/07, testn <te...@doramail.com> wrote:
>>
>>
>> you can use IndexReader.getTermFreqVectors(int n) to get all terms  
>> and
>> their
>> frequencies. Make sure when you create an index, you choose option to
>> store
>> it by specifying Field.TermVector option.
>> Check out http://www.cnlp.org/presentations/slides/ 
>> AdvancedLuceneEU.pdf
>>
>>
>>
>> tierecke wrote:
>>>
>>> Hi,
>>>
>>> I indexed a large number of large documents, but I did not store the
>>> document themselves, just indexed them.
>>> Now I am interested in getting the vector (i.e.: the terms  
>>> indexed and
>> the
>>> frequency) of that indexed but unstored field.
>>> doc.getField (fieldname) returns null.
>>> How can I get the data? It must be there, since it's a part of the
>> index,
>>> or am I wrong?
>>>
>>> Would be grateful for a quick result (need to submit data for a
>> conference
>>> this weekend).
>>> thanks,
>>> Nir.
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Get-the-terms-and-frequency-vector-of-an- 
>> indexed-but-unstored-field-tf4211430.html#a11981677
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Get the terms and frequency vector of an indexed but unstored field

Posted by Shailendra Mudgal <mu...@gmail.com>.

Hi,
If while indexing we have not set this flag, then is there any other way to
get this info, i mean the TermFreqVector for a document ??



On 8/3/07, testn <te...@doramail.com> wrote:
>
>
> you can use IndexReader.getTermFreqVectors(int n) to get all terms and
> their
> frequencies. Make sure when you create an index, you choose option to
> store
> it by specifying Field.TermVector option.
> Check out http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf
>
>
>
> tierecke wrote:
> >
> > Hi,
> >
> > I indexed a large number of large documents, but I did not store the
> > document themselves, just indexed them.
> > Now I am interested in getting the vector (i.e.: the terms indexed and
> the
> > frequency) of that indexed but unstored field.
> > doc.getField (fieldname) returns null.
> > How can I get the data? It must be there, since it's a part of the
> index,
> > or am I wrong?
> >
> > Would be grateful for a quick result (need to submit data for a
> conference
> > this weekend).
> > thanks,
> > Nir.
> >
>
> --
> View this message in context:
> http://www.nabble.com/Get-the-terms-and-frequency-vector-of-an-indexed-but-unstored-field-tf4211430.html#a11981677
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Get the terms and frequency vector of an indexed but unstored field

Posted by testn <te...@doramail.com>.

you can use IndexReader.getTermFreqVectors(int n) to get all terms and their
frequencies. Make sure when you create an index, you choose option to store
it by specifying Field.TermVector option.
Check out http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf



tierecke wrote:
> 
> Hi,
> 
> I indexed a large number of large documents, but I did not store the
> document themselves, just indexed them.
> Now I am interested in getting the vector (i.e.: the terms indexed and the
> frequency) of that indexed but unstored field.
> doc.getField (fieldname) returns null.
> How can I get the data? It must be there, since it's a part of the index,
> or am I wrong?
> 
> Would be grateful for a quick result (need to submit data for a conference
> this weekend).
> thanks,
> Nir.
> 

-- 
View this message in context: http://www.nabble.com/Get-the-terms-and-frequency-vector-of-an-indexed-but-unstored-field-tf4211430.html#a11981677
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Get the terms and frequency vector of an indexed but unstored field

Posted by tierecke <ni...@gmail.com>.

Thanks a lot, that works 100%!...
Fortunately, I did use the flag to state that Lucene should store the term
frequency vector. Otherwise, I'd have to index 77GB right now... :-)
-- 
View this message in context: http://www.nabble.com/Get-the-terms-and-frequency-vector-of-an-indexed-but-unstored-field-tf4211430.html#a11983495
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Get the TokenStream of an indexed but unstored field

Posted by tierecke <ni...@gmail.com>.

I fixed my question later. I meant I did not STORE the document themselves.
Anyway - the issue is already solved, thank to testn.
But there are new hard (for me) questions.
Thanks a lot!

Erick Erickson wrote:
> 
> I indexed a large number of large documents, but I did not index the
> document themselves.
> 
> This is really confusing since it's self-contradictory. Could you
> post the lines where you do the document.add() for the fields in
> question?
> 
> Best
> Erick
> 
-- 
View this message in context: http://www.nabble.com/Get-the-terms-and-frequency-vector-of-an-indexed-but-unstored-field-tf4211430.html#a11984434
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Get the TokenStream of an indexed but unstored field

Posted by Erick Erickson <er...@gmail.com>.

<<<I indexed a large number of large documents, but I did not index the
document themselves.>>>

This is really confusing since it's self-contradictory. Could you
post the lines where you do the document.add() for the fields in
question?

Best
Erick

On 8/3/07, tierecke <ni...@gmail.com> wrote:
>
>
> Hi,
>
> I indexed a large number of large documents, but I did not index the
> document themselves.
> Now I am interested in getting the vector (i.e.: the terms indexed and the
> frequency) of that indexed but unstored field.
> doc.getField (fieldname) returns null.
> How can I get the data? It must be there, since it's a part of the index,
> or
> am I wrong?
>
> Would be grateful for a quick result (need to submit data for a conference
> this weekend).
> thanks,
> Nir.
> --
> View this message in context:
> http://www.nabble.com/Get-the-TokenStream-of-an-indexed-but-unstored-field-tf4211430.html#a11980001
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>