You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Grant Ingersoll <gs...@syr.edu> on 2004/07/28 23:35:28 UTC
Re: TermFreqVector Beginner Question
Can you post the whole section of related code? Sounds like you are doing things right.
In the Lucene source code, there is a file called TestTermVectors.java, take a look at that and see how your stuff compares. I ran the test against the HEAD and it worked.
>>> matt@thebasement.com 07/28/04 04:51PM >>>
Howdy,
I am new to Lucene and thus far I am very impressed. Thanks to all who have
worked on this project!
I am working on a project where I want to do the following:
1.) Index a bunch of document.
2.) Pluck out one of the doucments by Lucene document number
3.) Get a term frequency for that document
After some digging and playing I came across this method...
IndexReader.getTermFreqVector(int docNumber, String field)
This is exactly what I want. So I ran the IndexFiles demo program with some
test documents and started poking at the index with an IndexReader. But when I
called
IndexReader.getTermFreqVector(someDocNumber,"contents")
I get NULL back. After a little more digging I find that for a TermVector to
exist the Field has to have the TermVector flag set. So I changes some lines
in the demo FileDocument.Document method to:
FileInputStream is = new FileInputStream(f);
Reader reader = new BufferedReader(new InputStreamReader(is));
doc.add(Field.Text("contents", reader.toString(),true));
with the "true" parameter causing the new Field to turn on the storeTermVector
flag, right? So then I reindex and get the same results - getTermFreqVector
returns NULL. So I inspect the field list of the Document from the index:
Document d = ir.document(td.doc());
System.out.println(" Path: "+d.get("path"));
for (Enumeration e = d.fields() ; e.hasMoreElements() ;)
{
System.out.println(((Field)e.nextElement()).toString());
}
and I discover that there is now NO "contents" Field. If I change the paramter
in Field.Text to false, I get a "contents" Field but no TermVector. To date I
haven't been able to figure out how to get a TermFreqVector at all.
What am I missing?
I have looked at the documents - all the tutorials I have found just cover the
basics.
I have read the news group postings related to "TermVectors" and
"TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff is
great". So how do they know? Where can I learn about this? Are there any more
complete user tutorials/references that cover TermVector features?
Oh, I am using the 1.4 Lucene release in case it matters.
Thanks in advance,
Matt Galloway
Tulsa, Oklahoma
(BTW, I also tired Field.UnStored with the same results.)
-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/
----- End forwarded message -----
-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: TermFreqVector Beginner Question
Posted by Daniel Naber <da...@t-online.de>.
On Thursday 29 July 2004 17:31, Matt Galloway wrote:
> Field.Text(String name, Reader value, boolean storeTermVector)
> Field.UnStored(String name, String value, boolean storeTermVector)
>
> DO NOT store the contents of the field
This part of the API is known to be difficult and will be fixed for Lucene 2.0
(which is the next version). Till then, I'll try to remember to extend the
documentation.
Regards
Daniel
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: TermFreqVector Beginner Question
Posted by Matt Galloway <ma...@thebasement.com>.
Well, as one would expect most of the problems were me. Here is what I
learned... (please comment on the accuracy of these statements).
1.) Setting storeTermVertor to true does nothing if store is false, i.e.
you must store the contents of a filed in order to retrieve TermVectors
for it later. This may seem obvious to everyone else, but to a new user
this is anything but obvious as it is not documented anywhere that I have
seen. I think it would be very helpful to include this tidbit in the
Field class JavaDoc if not in the FAQ or some other place.
I also think it would be helpful to prevent the user from combinations
of store and storeTermVector that don't make sense, namely store = false
and storeTermVector = true. Maybe an exception or something.
2.) The following methods...
Field.Text(String name, Reader value, boolean storeTermVector)
Field.UnStored(String name, String value, boolean storeTermVector)
DO NOT store the contents of the field and (based on my assumption in
point 1 and through observation) consequently DO NOT store TermVectors
despite the value of their storeTermVector value. If this is accurate,
why do these methods exist? This is very misleading to the new user.
3.) I am also new to Java so if you look at my earlier sample code you
will see that I used "reader.toString()" where reader is a buffered file
reader. This of course is not the desired effect. I have since rewritten
the code to reflect the a string that contains the content of the file
instead of some vector address thing. This doesn't affect Lucene or
term vectors, just my ego.
Once you understand that stor=true is ALSO a prerequisite for TermVectors (in
addition to storeTermVector=true) then everything works great.
Thanks for the help,
Matt Galloway
Quoting Grant Ingersoll <gs...@syr.edu>:
> Can you post the whole section of related code? Sounds like you are doing
> things right.
>
> In the Lucene source code, there is a file called TestTermVectors.java, take
> a look at that and see how your stuff compares. I ran the test against the
> HEAD and it worked.
>
> >>> matt@thebasement.com 07/28/04 04:51PM >>>
>
> Howdy,
>
> I am new to Lucene and thus far I am very impressed. Thanks to all who
> have
> worked on this project!
>
> I am working on a project where I want to do the following:
>
> 1.) Index a bunch of document.
> 2.) Pluck out one of the doucments by Lucene document number
> 3.) Get a term frequency for that document
>
> After some digging and playing I came across this method...
>
> IndexReader.getTermFreqVector(int docNumber, String field)
>
> This is exactly what I want. So I ran the IndexFiles demo program with
> some
> test documents and started poking at the index with an IndexReader. But when
> I
> called
>
> IndexReader.getTermFreqVector(someDocNumber,"contents")
>
> I get NULL back. After a little more digging I find that for a TermVector
> to
> exist the Field has to have the TermVector flag set. So I changes some
> lines
> in the demo FileDocument.Document method to:
>
> FileInputStream is = new FileInputStream(f);
> Reader reader = new BufferedReader(new InputStreamReader(is));
> doc.add(Field.Text("contents", reader.toString(),true));
>
> with the "true" parameter causing the new Field to turn on the
> storeTermVector
> flag, right? So then I reindex and get the same results - getTermFreqVector
> returns NULL. So I inspect the field list of the Document from the index:
>
> Document d = ir.document(td.doc());
> System.out.println(" Path: "+d.get("path"));
> for (Enumeration e = d.fields() ; e.hasMoreElements() ;)
> {
> System.out.println(((Field)e.nextElement()).toString());
> }
>
> and I discover that there is now NO "contents" Field. If I change the
> paramter
> in Field.Text to false, I get a "contents" Field but no TermVector. To date
> I
> haven't been able to figure out how to get a TermFreqVector at all.
>
> What am I missing?
>
> I have looked at the documents - all the tutorials I have found just cover
> the
> basics.
>
> I have read the news group postings related to "TermVectors" and
> "TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff
> is
> great". So how do they know? Where can I learn about this? Are there any
> more
> complete user tutorials/references that cover TermVector features?
>
> Oh, I am using the 1.4 Lucene release in case it matters.
>
> Thanks in advance,
>
> Matt Galloway
> Tulsa, Oklahoma
>
>
> (BTW, I also tired Field.UnStored with the same results.)
-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org