You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ro...@earthlink.net on 2009/01/23 22:01:14 UTC

why would a Field *vanish* from a Document?

R 2.4

I am rather new to Lucene, and it is entirely likely that the problem is somehow mine -- but I have no clue how. 

I have an index (of files), all fine; I post keyword queries and get lists of documents, all fine. 

Now I want to query against something other than keywords, so (here I am in part trying to reuse someone else's code, since I hardly understand enough about Lucene to write even a simple search) I try adding a Field to my Document's, under some conditions, like this: 

<pseudo-code>
Document document = new Document();
if (someCondition()) { 
	TokenStream stream = makeMyTokenStream(); // has correct reset() method 
	Field field = new Field("_MARKUP_TAGS_contents", stream, Field.TermVector.WITH_OFFSETS);
	document.add(field);
}
// verify here that the token-stream is behaving correctly 
// verify here that the "_MARKUP_TAGS_contents" Field *IS* (YES!) added 

// now add the "typical" fields; nothing surprising here: 
document.add(makeField(LENGTH_FIELD, Integer.toString(_docLength)));
document.add(makeField(NAME_FIELD, _documentReference));
document.add(makeField(DATE_FIELD, new Date().toString()));

... 
	protected Field makeField(final String name, final String value) {
		return new Field(name, value, Field.Store.YES, Field.Index.NOT_ANALYZED);
	}

</pseudo-code>

I don't *think* there's any problem there, and the token-stream that I create seems to behave as expected. 

The query that I post for a test is one that is simple and is satisfied by most of the documents in my small corpus (of ten documents). 

BUT .. when I post that query, the documents that are returned NEVER contain the additional field added above under "some condition". Let me repeat: I have verified that the Field *IS* added in the pseudo-code above. 

Here is the code I use to post a query and retrieve Documents: 
<pseudo-code>
IndexReader _anIndexReader = IndexReader.open(getFileRepresentingIndexDirectory());
IndexSearcher _anIndexSearcher = new IndexSearcher(_anIndexReader );

TopFieldDocs tfd = _anIndexSearcher.search(someQuery, null, 1000, new Sort()); 
int length = tfd.totalHits; 

for (int i = 0; i < length; ++i) {
	int docIndex = tfd.scoreDocs[i].doc;
	Document doc = _anIndexReader.document(docIndex); 
	String name = doc.get(NAME_FIELD); // same as in first block of pseudo-code 

	System.err.println("The document " + name + " has these Fields:");
	for (final Object item : doc.getFields()) {
		Field field = (Field) item; 
		System.err.println("\t" + field.name());
	}
	// only ever has the "typical" Fields from first block of pseudo-code
}
</pseudo-code>


Thus it is clear that either the Lucene-index strips the extra Field (why?) or that the Documents that are returned are somehow newly-created and lack the extra field, but contain all the "typical" Fields (why?) 

Any ideas? 

thanks,
Paul 





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: why would a Field *vanish* from a Document?

Posted by Michael McCandless <lu...@mikemccandless.com>.
This (the sneaky "difference" between an indexed Document and a the  
newly-created-at-search-time Document) is a frequent confusion with  
Lucene.

The field needs to be marked as stored (Field.Store.YES) in order for  
it to appear in the retrieved document at search time.

But, TokenStream fields cannot be stored since Lucene can't regenerate  
the original string for that field.

Since you are storing the term vector, you could retrieve that using  
IndexReader.getTermFreqVector.

Mike

rolarenfan@earthlink.net wrote:

> R 2.4
>
> I am rather new to Lucene, and it is entirely likely that the  
> problem is somehow mine -- but I have no clue how.
>
> I have an index (of files), all fine; I post keyword queries and get  
> lists of documents, all fine.
>
> Now I want to query against something other than keywords, so (here  
> I am in part trying to reuse someone else's code, since I hardly  
> understand enough about Lucene to write even a simple search) I try  
> adding a Field to my Document's, under some conditions, like this:
>
> <pseudo-code>
> Document document = new Document();
> if (someCondition()) {
> 	TokenStream stream = makeMyTokenStream(); // has correct reset()  
> method
> 	Field field = new Field("_MARKUP_TAGS_contents", stream,  
> Field.TermVector.WITH_OFFSETS);
> 	document.add(field);
> }
> // verify here that the token-stream is behaving correctly
> // verify here that the "_MARKUP_TAGS_contents" Field *IS* (YES!)  
> added
>
> // now add the "typical" fields; nothing surprising here:
> document.add(makeField(LENGTH_FIELD, Integer.toString(_docLength)));
> document.add(makeField(NAME_FIELD, _documentReference));
> document.add(makeField(DATE_FIELD, new Date().toString()));
>
> ...
> 	protected Field makeField(final String name, final String value) {
> 		return new Field(name, value, Field.Store.YES,  
> Field.Index.NOT_ANALYZED);
> 	}
>
> </pseudo-code>
>
> I don't *think* there's any problem there, and the token-stream that  
> I create seems to behave as expected.
>
> The query that I post for a test is one that is simple and is  
> satisfied by most of the documents in my small corpus (of ten  
> documents).
>
> BUT .. when I post that query, the documents that are returned NEVER  
> contain the additional field added above under "some condition". Let  
> me repeat: I have verified that the Field *IS* added in the pseudo- 
> code above.
>
> Here is the code I use to post a query and retrieve Documents:
> <pseudo-code>
> IndexReader _anIndexReader =  
> IndexReader.open(getFileRepresentingIndexDirectory());
> IndexSearcher _anIndexSearcher = new IndexSearcher(_anIndexReader );
>
> TopFieldDocs tfd = _anIndexSearcher.search(someQuery, null, 1000,  
> new Sort());
> int length = tfd.totalHits;
>
> for (int i = 0; i < length; ++i) {
> 	int docIndex = tfd.scoreDocs[i].doc;
> 	Document doc = _anIndexReader.document(docIndex);
> 	String name = doc.get(NAME_FIELD); // same as in first block of  
> pseudo-code
>
> 	System.err.println("The document " + name + " has these Fields:");
> 	for (final Object item : doc.getFields()) {
> 		Field field = (Field) item;
> 		System.err.println("\t" + field.name());
> 	}
> 	// only ever has the "typical" Fields from first block of pseudo-code
> }
> </pseudo-code>
>
>
> Thus it is clear that either the Lucene-index strips the extra Field  
> (why?) or that the Documents that are returned are somehow newly- 
> created and lack the extra field, but contain all the "typical"  
> Fields (why?)
>
> Any ideas?
>
> thanks,
> Paul
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org