You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by et...@comcast.net on 2005/04/16 04:21:33 UTC
token type question
Hi,
I am working on a program to index/search chemical element/compound. Say I write an analyzer to filter out chemical terms, such as H2O. I noticed that I can specify a tocken's type. Can I construct a token as
new Token ("H2", start, end, "chem");
My questions is
How do I search all the tokens with "chem" type token, such as H2O, O2, etc? Any sample like this?
If this approach doesn't work, what's the best approach?
Thanks,
Ethan
Re: token type question
Posted by Paul Libbrecht <pa...@activemath.org>.
Le 16 avr. 05, à 08:31, Pierrick Brihaye a écrit :
>> How do I search all the tokens with "chem" type token, such as H2O,
>> O2, etc? Any sample like this? If this approach doesn't work, what's
>> the best approach?
Nifty question... I'm working on indexing text with math formulae...
there may be similarities !
> You may assign a type to the tokens, and then you may filter them
> according to their type *but* the index forgets this info since it
> stores *terms* (field/value pairs). [...]
> 1) use a dedicated field "chem" where only chemical content is allowed
> (filter out every token whose type is different from "chem")
> 2) manipulate your termText : "chem_H2" ; the same for your queries
> 3) play with the query rather than with the index content : filter out
> what is not chemical
So it really seems chem_H2 is the only choice, or ?
What's your requirements or expectations ?
- match a formula in the middle of a sentence ?
- or simply match documents that contain both the sentence's words and
the formula (in the latter case, I think solution 1 is valid)
- how would you do wildcards with formulae ?
A related question, at least for me, is how to match a+(b+1) when the
query is X+Y, ie. subtree cut.
Does this occur in chemical formulae as well?
paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: token type question
Posted by Pierrick Brihaye <pi...@free.fr>.
ethandev@comcast.net a écrit :
> I am working on a program to index/search chemical element/compound. Say I write an analyzer to filter out chemical terms, such as H2O. I noticed that I can specify a tocken's type. Can I construct a token as
> new Token ("H2", start, end, "chem");
>
> My questions is
> How do I search all the tokens with "chem" type token, such as H2O, O2, etc? Any sample like this?
>
> If this approach doesn't work, what's the best approach?
You may assign a type to the tokens, and then you may filter them
according to their type *but* the index forgets this info since it
stores *terms* (field/value pairs).
Compare :
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html
and
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/Term.html
Notice however that the terms have also their relative position (the
Token's positionIncrement, default = 1) stored in the index ; this
allows proximity searches.
So... how to do ?
1) use a dedicated field "chem" where only chemical content is allowed
(filter out every token whose type is different from "chem")
2) manipulate your termText : "chem_H2" ; the same for your queries
3) play with the query rather than with the index content : filter out
what is not chemical
There may be other solutions...
Cheers,
p.b.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org