You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by House Less <ho...@yahoo.com> on 2009/06/08 02:14:17 UTC

Retrieving the term vectors of a document in Nutch

Hello everyone,

I am quite new to development with Nutch, so you must forgive my question if it is amateurish.

After some reading of Luke's source code, I found to my dismay that obtaining the TermFreqVector of a document via the IndexReader resulted in no vectors at all. A mailing list entry found via Google said that Nutch does not store the contents of a page in its Lucene indices. This makes sense.

I then read the Nutch source code and figured out that one could use NutchBean to reconstruct the parsed text of an indexed page.

However, this still left the nagging problem of retrieving the TermFreqVector for the parsed text of a page. I tried MoreLikeThis to retrieve the set of terms but that did not work either; it was simply empty. The source code to MoreLikeThis suggests certain assumptions made on the Lucene indices being accessed.

At the end of the day, I simply decided to reconstruct the term frequency vector of a page by referring to TermDocs in the IndexReader. This is not very efficient since I have to do this for every page iterated over the Lucene document index.

I wonder whether it is possible to retrieve previously computed TermFreqVector[] of a document in Nutch's Lucene indices? Could it be that this is not possible because Nutch does not store the TermFreqVector[]? Your insights on the matter will help.

House



      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Retrieving the term vectors of a document in Nutch

Posted by Andrzej Bialecki <ab...@getopt.org>.
(moved to nutch-user)

House Less wrote:
> In retrospect, pardon my stupidity: surely it cannot be right that
> the term frequency vector for a page is not present within Nutch, for
> it needs this to compute the score for a page given a query. I would
> appreciate it if you would tell me where I may find it given a
> document number. Thank you.

This is not a silly question. Indeed, Lucene uses term frequency vector 
model when computing scores, but it doesn't necessarily mean that term 
frequency vector _per_ _document_ is explicitly stored ... and in fact 
Nutch does not store this data by default. You would have to modify the 
indexing plugins to add this information, and then extend the Nutch API 
to be able to retrieve this via NutchBean.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Retrieving the term vectors of a document in Nutch

Posted by atcach <at...@gmail.com>.
Hi House
I had the same problem and tried the same solution, but I am getting an
empty termDocs. How have you done it ?
My code is:
TermDocs td = ir.termDocs();
			// Primero los guardo en un temporal porque ponerlos en un array no puedo
porque me falta la cantidad
			
		    while (td.next()) {
		         if (td.doc() == m_docNro) {
		        	 tfs.add(td.toString());
		        	 System.out.println("Documento:" + td.toString());
		             tfi.add(td.freq());
		         }

----
It never enters the while.
Regards !

--
View this message in context: http://lucene.472066.n3.nabble.com/Retrieving-the-term-vectors-of-a-document-in-Nutch-tp560993p3647617.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Retrieving the term vectors of a document in Nutch

Posted by House Less <ho...@yahoo.com>.
Hello Grant,



> I'd ask on the nutch-user@lucene.apache.org mailing list.  While Lucene can do 
> all of these things, it is not clear how Nutch exposes, if at all, any of this 
> information.  You should be able to get results there.

Thanks, I'll be sure to ask them.
 
> Note, however, that Term Vecs must be created during indexing by creating the 
> Field properly.  You could likely modify the Nutch code where it creates the 
> Lucene Document and Fields to add in Term Vector capabilities.

That might work. Thanks again for the pointers!



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Retrieving the term vectors of a document in Nutch

Posted by Grant Ingersoll <gs...@apache.org>.
I'd ask on the nutch-user@lucene.apache.org mailing list.  While  
Lucene can do all of these things, it is not clear how Nutch exposes,  
if at all, any of this information.  You should be able to get results  
there.

Note, however, that Term Vecs must be created during indexing by  
creating the Field properly.  You could likely modify the Nutch code  
where it creates the Lucene Document and Fields to add in Term Vector  
capabilities.

-Grant


On Jun 7, 2009, at 8:58 PM, House Less wrote:

>
> In retrospect, pardon my stupidity: surely it cannot be right that  
> the term frequency vector for a page is not present within Nutch,  
> for it needs this to compute the score for a page given a query. I  
> would appreciate it if you would tell me where I may find it given a  
> document number. Thank you.
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Retrieving the term vectors of a document in Nutch

Posted by House Less <ho...@yahoo.com>.
In retrospect, pardon my stupidity: surely it cannot be right that the term frequency vector for a page is not present within Nutch, for it needs this to compute the score for a page given a query. I would appreciate it if you would tell me where I may find it given a document number. Thank you.



      


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org