You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Grant Ingersoll <gs...@syr.edu> on 2004/07/28 23:35:28 UTC

Re: TermFreqVector Beginner Question

Can you post the whole section of related code?  Sounds like you are doing things right.  

In the Lucene source code, there is a file called TestTermVectors.java, take a look at that and see how your stuff compares.  I ran the test against the HEAD and it worked.

>>> matt@thebasement.com 07/28/04 04:51PM >>>

Howdy,

I am new to Lucene and thus far I am very impressed.  Thanks to all who have
worked on this project!

I am working on a project where I want to do the following:

1.) Index a bunch of document.
2.) Pluck out one of the doucments by Lucene document number
3.) Get a term frequency for that document

After some digging and playing I came across this method...

   IndexReader.getTermFreqVector(int docNumber, String field)

This is exactly what I want.  So I ran the IndexFiles demo program with some
test documents and started poking at the index with an IndexReader. But when I
called

   IndexReader.getTermFreqVector(someDocNumber,"contents")

I get NULL back.  After a little more digging I find that for a TermVector to
exist the Field has to have the TermVector flag set.  So I changes some lines
in the demo FileDocument.Document method to:

    FileInputStream is = new FileInputStream(f);
    Reader reader = new BufferedReader(new InputStreamReader(is));
    doc.add(Field.Text("contents", reader.toString(),true));

with the "true" parameter causing the new Field to turn on the storeTermVector
flag, right? So then I reindex and get the same results - getTermFreqVector
returns NULL.  So I inspect the field list of the Document from the index:

   Document d = ir.document(td.doc());
   System.out.println("  Path: "+d.get("path"));
   for (Enumeration e = d.fields() ; e.hasMoreElements() ;) 
   {
      System.out.println(((Field)e.nextElement()).toString());
   }

and I discover that there is now NO "contents" Field.  If I change the paramter
in Field.Text to false, I get a "contents" Field but no TermVector.  To date I
haven't been able to figure out how to get a TermFreqVector at all.

What am I missing?

I have looked at the documents - all the tutorials I have found just cover the
basics.

I have read the news group postings related to "TermVectors" and
"TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff is
great".  So how do they know?  Where can I learn about this?  Are there any more
complete user tutorials/references that cover TermVector features?

Oh, I am using the 1.4 Lucene release in case it matters.

Thanks in advance,

Matt Galloway
Tulsa, Oklahoma


(BTW, I also tired Field.UnStored with the same results.)



-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/ 

----- End forwarded message -----




-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org 
For additional commands, e-mail: lucene-user-help@jakarta.apache.org 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: TermFreqVector Beginner Question

Posted by Daniel Naber <da...@t-online.de>.
On Thursday 29 July 2004 17:31, Matt Galloway wrote:

>  Field.Text(String name, Reader value, boolean storeTermVector)
>    Field.UnStored(String name, String value, boolean storeTermVector)
>
>    DO NOT store the contents of the field

This part of the API is known to be difficult and will be fixed for Lucene 2.0 
(which is the next version). Till then, I'll try to remember to extend the 
documentation.

Regards
 Daniel


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: TermFreqVector Beginner Question

Posted by Matt Galloway <ma...@thebasement.com>.
Well, as one would expect most of the problems were me.  Here is what I
learned... (please comment on the accuracy of these statements).

1.) Setting storeTermVertor to true does nothing if store is false, i.e. 
   you must store the contents of a filed in order to retrieve TermVectors 
   for it later.  This may seem obvious to everyone else, but to a new user 
   this is anything but obvious as it is not documented anywhere that I have
   seen.  I think it would be very helpful to include this tidbit in the 
   Field class JavaDoc if not in the FAQ or some other place.

   I also think it would be helpful to prevent the user from combinations 
   of store and storeTermVector that don't make sense, namely store = false 
   and storeTermVector = true.  Maybe an exception or something.

2.) The following methods...

   Field.Text(String name, Reader value, boolean storeTermVector) 
   Field.UnStored(String name, String value, boolean storeTermVector) 

   DO NOT store the contents of the field and (based on my assumption in 
   point 1 and through observation) consequently DO NOT store TermVectors
   despite the value of their storeTermVector value.  If this is accurate,
   why do these methods exist?  This is very misleading to the new user.

3.) I am also new to Java so if you look at my earlier sample code you
   will see that I used "reader.toString()" where reader is a buffered file
   reader.  This of course is not the desired effect.  I have since rewritten
   the code to reflect the a string that contains the content of the file
   instead of some vector address thing.  This doesn't affect Lucene or
   term vectors, just my ego.

Once you understand that stor=true is ALSO a prerequisite for TermVectors (in
addition to storeTermVector=true) then everything works great.

Thanks for the help,

Matt Galloway


Quoting Grant Ingersoll <gs...@syr.edu>:

> Can you post the whole section of related code?  Sounds like you are doing
> things right.  
> 
> In the Lucene source code, there is a file called TestTermVectors.java, take
> a look at that and see how your stuff compares.  I ran the test against the
> HEAD and it worked.
> 
> >>> matt@thebasement.com 07/28/04 04:51PM >>>
> 
> Howdy,
> 
> I am new to Lucene and thus far I am very impressed.  Thanks to all who
> have
> worked on this project!
> 
> I am working on a project where I want to do the following:
> 
> 1.) Index a bunch of document.
> 2.) Pluck out one of the doucments by Lucene document number
> 3.) Get a term frequency for that document
> 
> After some digging and playing I came across this method...
> 
>    IndexReader.getTermFreqVector(int docNumber, String field)
> 
> This is exactly what I want.  So I ran the IndexFiles demo program with
> some
> test documents and started poking at the index with an IndexReader. But when
> I
> called
> 
>    IndexReader.getTermFreqVector(someDocNumber,"contents")
> 
> I get NULL back.  After a little more digging I find that for a TermVector
> to
> exist the Field has to have the TermVector flag set.  So I changes some
> lines
> in the demo FileDocument.Document method to:
> 
>     FileInputStream is = new FileInputStream(f);
>     Reader reader = new BufferedReader(new InputStreamReader(is));
>     doc.add(Field.Text("contents", reader.toString(),true));
> 
> with the "true" parameter causing the new Field to turn on the
> storeTermVector
> flag, right? So then I reindex and get the same results - getTermFreqVector
> returns NULL.  So I inspect the field list of the Document from the index:
> 
>    Document d = ir.document(td.doc());
>    System.out.println("  Path: "+d.get("path"));
>    for (Enumeration e = d.fields() ; e.hasMoreElements() ;) 
>    {
>       System.out.println(((Field)e.nextElement()).toString());
>    }
> 
> and I discover that there is now NO "contents" Field.  If I change the
> paramter
> in Field.Text to false, I get a "contents" Field but no TermVector.  To date
> I
> haven't been able to figure out how to get a TermFreqVector at all.
> 
> What am I missing?
> 
> I have looked at the documents - all the tutorials I have found just cover
> the
> basics.
> 
> I have read the news group postings related to "TermVectors" and
> "TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff
> is
> great".  So how do they know?  Where can I learn about this?  Are there any
> more
> complete user tutorials/references that cover TermVector features?
> 
> Oh, I am using the 1.4 Lucene release in case it matters.
> 
> Thanks in advance,
> 
> Matt Galloway
> Tulsa, Oklahoma
> 
> 
> (BTW, I also tired Field.UnStored with the same results.)

-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org