You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul Dlug <pa...@gmail.com> on 2010/07/22 20:27:55 UTC
Providing token variants at index time
Is there a tokenizer that supports providing variants of the tokens at
index time? I'm looking for something that could take a syntax like:
International|I Business|B Machines|M
Which would take each pipe delimited token and preserve its position
so that phrase queries work properly. The above would result in
queries for "International Business Machines" as well as "I B M" or
any variants. The point is that the variants would be generated
externally as part of the indexing process so they may not be as
simple as the above.
Any ideas or do I have to write a custom tokenizer to do this?
Thanks,
Paul
Re: Providing token variants at index time
Posted by Jonathan Rochkind <ro...@jhu.edu>.
Paul Dlug wrote:
> On Thu, Jul 22, 2010 at 4:01 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:
>
>
> The synonym approach won't work as I need to provide them in a file.
> The variants may be more dynamic and not known in advance, the process
> creating the documents to index does have that logic and could easily
> put them into the document in a format a tokenizer could pull apart
> later.
Then maybe look at the source code of the synonyms file, and build your
own filter, copying the parts that do the real work (or even
sub-classing), but instead of using a file, using the transient state
information that is for some reason only available at indexing time?
Don't entirely understand your use case, if you give some more explicit
examples, others might have other ideas.
Joanthan
Re: Providing token variants at index time
Posted by Paul Dlug <pa...@gmail.com>.
On Thu, Jul 22, 2010 at 4:01 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:
> I think the Synonym filter should actually do exactly what you want, no?
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>
> Hmm, maybe not exactly what you want as you describe it. It comes close,
> maybe good enough. Do you REALLY need to support "I Business M" or "I B
> Machines" as source/query? Your spec suggests yes, synonym filter won't
> easily do that.But if you just want "International Business Machines" ==
> "IBM", keeping positions intact for subsequent terms, I think synonym filter
> will do it.
> If not, I suppose you could look at it's source to write your own. Or maybe
> there's some way to combine the PositionFilter with something else to do it,
> but I can't figure one out.
The synonym approach won't work as I need to provide them in a file.
The variants may be more dynamic and not known in advance, the process
creating the documents to index does have that logic and could easily
put them into the document in a format a tokenizer could pull apart
later.
--Paul
Re: Providing token variants at index time
Posted by Jonathan Rochkind <ro...@jhu.edu>.
I think the Synonym filter should actually do exactly what you want, no?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
Hmm, maybe not exactly what you want as you describe it. It comes close,
maybe good enough. Do you REALLY need to support "I Business M" or "I B
Machines" as source/query? Your spec suggests yes, synonym filter won't
easily do that.But if you just want "International Business Machines" ==
"IBM", keeping positions intact for subsequent terms, I think synonym
filter will do it.
If not, I suppose you could look at it's source to write your own. Or
maybe there's some way to combine the PositionFilter with something else
to do it, but I can't figure one out.
Jonathan
Paul Dlug wrote:
> Is there a tokenizer that supports providing variants of the tokens at
> index time? I'm looking for something that could take a syntax like:
>
> International|I Business|B Machines|M
>
> Which would take each pipe delimited token and preserve its position
> so that phrase queries work properly. The above would result in
> queries for "International Business Machines" as well as "I B M" or
> any variants. The point is that the variants would be generated
> externally as part of the indexing process so they may not be as
> simple as the above.
>
> Any ideas or do I have to write a custom tokenizer to do this?
>
>
> Thanks,
> Paul
>
>