You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Chuck Williams <ch...@manawiz.com> on 2006/06/09 19:50:42 UTC

Prefix and general wildcards

Hi all,

I need to support query expressions like *xyz and possibly *lmn*.  The
former can be done with high search efficiency by storing (delimited)
reversed tokens and the latter by storing all (delimited) rotations for
each token.  However, both of these techniques have high index overhead,
the rotations being considerably worse than just the reversals.  In
principle, nothing is needed for the reversed or rotated tokens others
than the tokens themselves as their position and term vector information
is the same as the base token.

Have others found a better solution for this?

If not, it occurs to me that one simple and substantial optimization is
to support a token filter for term vectors, i.e. pass tokens through an
additional filter for addition to term vectors.  Unless there is a
better solution, I'll post such a patch.

Thanks for any advice,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Prefix and general wildcards

Posted by Chuck Williams <ch...@manawiz.com>.

Doug Cutting wrote on 06/09/2006 11:00 AM:
> Why not instead add the rotated and/or reversed tokens to a different
> field that does not store vectors?

That would be a better idea.  Thanks!

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Prefix and general wildcards

Posted by Chuck Williams <ch...@manawiz.com>.

Doug Cutting wrote on 06/09/2006 08:00 AM:
> Chuck Williams wrote:
>> one simple and substantial optimization is
>> to support a token filter for term vectors, i.e. pass tokens through an
>> additional filter for addition to term vectors.
>
> Why not instead add the rotated and/or reversed tokens to a different
> field that does not store vectors?
>
I'm running into issues with the separate field approach.  This would
seem to require either rereading the content or storing all of the
reversed/rotated tokens for subsequent generation out of a data
structure.  Both of these are performance problems, and in my app
rereading is not even practical.  Some fields are entire large
documents; requirements prohibit any truncation.  The content is
streamed to the indexer through soap, whence the additional rereading
problems.

It seems easiest and most efficient to have an additional filter on the
tokens that go into a term vector.  Am I missing an easier way to set up
a separate field?

I understand the desire to not add facilities to Lucene when there is an
existing method to achieve the same end, but it is not clear than using
an additional field is a practical approach.  It also seems that in
general the tokens useful in a term vector are only a subset of those
useful in the index -- at least this is the case for my app.

Thanks for any guidance,

Chuck

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Prefix and general wildcards

Posted by Doug Cutting <cu...@apache.org>.

Chuck Williams wrote:
> If not, it occurs to me that one simple and substantial optimization is
> to support a token filter for term vectors, i.e. pass tokens through an
> additional filter for addition to term vectors.  Unless there is a
> better solution, I'll post such a patch.

Why not instead add the rotated and/or reversed tokens to a different 
field that does not store vectors?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org