You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2008/08/27 16:47:30 UTC

Analyzer and Fieldable, different stored and indexed values

Hi all,

I recently had a situation where I had to pass some metadata information 
to Analyzer. This metadata was specific to a Document instance (short 
story is that the analysis of some fields depended on data coming from 
other fields, and the number of possible values was too big to use 
separate fields for each combination).

It would be nice to have an Analyzer.tokenStream(String fieldName, Field 
f), or even better tokenStream(String fieldName, Document doc) ... but 
probably it's too intrusive to change this. Although I would be happy to 
have tokenStream(String, Fieldable), because then I could provide my own 
Fieldable with metadata.

In the meantime, having neither option, I came up with an idea: I will 
use a subclass of Reader, and attach my metadata there, and then use 
this Reader when creating a Field. However, I quickly discovered that if 
you set a Reader on a Field, this field automatically becomes un-stored 
- not what I wanted ... Field is declared final, so no luck there.

In the end I implemented a Fieldable, which sort of breaks the contract 
for Fieldable - but it works :) . Namely, my Fieldable returns both 
readerValue() and stringValue(). The first method returns my subclass of 
Reader with metadata, and the second returns the value to be stored.

The reason why it works is that DocInverterPerField first checks the 
tokenStreamValue, then the readerValue, and only then the stringValue 
that it converts to a Reader - so in my case it uses the supplied 
readerValue. At the same time, FieldsWriter, which is responsible for 
storing field values, uses just the stringValue (or binaryValue, but 
that wasn't relevant to my case), which is also set to non-null.

So, here are my thoughts on this, and I'd appreciate any comments on this:

* is this a justified use of the API? it works, at least at the moment 
;) and I couldn't find any other way to accomplish this task.

* could we perhaps relax the restriction on Fieldable so that it can 
return non-null values from more than one method, and clearly document 
in what sequence they are processed? This is already hinted at in the 
javadoc.

* I propose to add a new API to Analyzer:

   public TokenStream tokenStream(String fieldName, Fieldable field);

to support use cases like the one I described above. The default 
implementation could be something like this:

   public TokenStream tokenStream(String fieldName, Fieldable field) {
	Reader r = field.readerValue();
	if (r == null) {
		String s = field.stringValue();
		r = new StringReader(s);
	}
	return tokenStream(fieldName, r);
   }


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Analyzer and Fieldable, different stored and indexed values

Posted by Andrzej Bialecki <ab...@getopt.org>.

Grant Ingersoll wrote:
> If I'm understanding correctly...
> 
> What about a SinkTokenizer that is backed by a Reader/Field instead of 
> the current one that stores it all in a List?  This is more or less the 
> use case for the Tee/Sink implementations, w/ the exception that we 
> didn't plan for the Sink being too large, but that is easily overcome, IMO.
> 
> That is, you use a TeeTokenFilter that adds to your Sink, which 
> serializes to some storage, and then your SinkTokenizer just 
> unserializes.  No need to change Fieldable at all or anything else
> 
> Or maybe just a Tokenizer that is backed by a Field would work and uses 
> a TermEnum on the Field to serve up next() for the TokenStream.
> 
> Just thinking out loud...

Actually, the scenario is more complicated, because I need to implement 
this as a Solr FieldType ... besides, wouldn't this mean that I can't 
store the original value, because I'm setting the tokenStream on a Field 
(which automatically makes it un-stored)?

Anyway, thanks for the hint, I'll check if I can do it this way. Other 
points about the new Analyzer API - I still think it would offer more 
flexibility than the current API, for a minimal cost in compatibility, 
and likely no cost in performance.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Analyzer and Fieldable, different stored and indexed values

Posted by Grant Ingersoll <gs...@apache.org>.

If I'm understanding correctly...

What about a SinkTokenizer that is backed by a Reader/Field instead of  
the current one that stores it all in a List?  This is more or less  
the use case for the Tee/Sink implementations, w/ the exception that  
we didn't plan for the Sink being too large, but that is easily  
overcome, IMO.

That is, you use a TeeTokenFilter that adds to your Sink, which  
serializes to some storage, and then your SinkTokenizer just  
unserializes.  No need to change Fieldable at all or anything else

Or maybe just a Tokenizer that is backed by a Field would work and  
uses a TermEnum on the Field to serve up next() for the TokenStream.

Just thinking out loud...

-Grant

On Aug 27, 2008, at 10:47 AM, Andrzej Bialecki wrote:

> Hi all,
>
> I recently had a situation where I had to pass some metadata  
> information to Analyzer. This metadata was specific to a Document  
> instance (short story is that the analysis of some fields depended  
> on data coming from other fields, and the number of possible values  
> was too big to use separate fields for each combination).
>
> It would be nice to have an Analyzer.tokenStream(String fieldName,  
> Field f), or even better tokenStream(String fieldName, Document  
> doc) ... but probably it's too intrusive to change this. Although I  
> would be happy to have tokenStream(String, Fieldable), because then  
> I could provide my own Fieldable with metadata.
>
> In the meantime, having neither option, I came up with an idea: I  
> will use a subclass of Reader, and attach my metadata there, and  
> then use this Reader when creating a Field. However, I quickly  
> discovered that if you set a Reader on a Field, this field  
> automatically becomes un-stored - not what I wanted ... Field is  
> declared final, so no luck there.
>
> In the end I implemented a Fieldable, which sort of breaks the  
> contract for Fieldable - but it works :) . Namely, my Fieldable  
> returns both readerValue() and stringValue(). The first method  
> returns my subclass of Reader with metadata, and the second returns  
> the value to be stored.
>
> The reason why it works is that DocInverterPerField first checks the  
> tokenStreamValue, then the readerValue, and only then the  
> stringValue that it converts to a Reader - so in my case it uses the  
> supplied readerValue. At the same time, FieldsWriter, which is  
> responsible for storing field values, uses just the stringValue (or  
> binaryValue, but that wasn't relevant to my case), which is also set  
> to non-null.
>
> So, here are my thoughts on this, and I'd appreciate any comments on  
> this:
>
> * is this a justified use of the API? it works, at least at the  
> moment ;) and I couldn't find any other way to accomplish this task.
>
> * could we perhaps relax the restriction on Fieldable so that it can  
> return non-null values from more than one method, and clearly  
> document in what sequence they are processed? This is already hinted  
> at in the javadoc.
>
> * I propose to add a new API to Analyzer:
>
>  public TokenStream tokenStream(String fieldName, Fieldable field);
>
> to support use cases like the one I described above. The default  
> implementation could be something like this:
>
>  public TokenStream tokenStream(String fieldName, Fieldable field) {
> 	Reader r = field.readerValue();
> 	if (r == null) {
> 		String s = field.stringValue();
> 		r = new StringReader(s);
> 	}
> 	return tokenStream(fieldName, r);
>  }
>
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>









---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org