You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Shlomy Reinstein <sr...@gmail.com> on 2010/05/23 16:03:32 UTC

Searching for a strict prefix

Hi,

I have a Lucene index that contains source code tags (a tag can be any
named source code element - function, class, variable). Each document
contains a field with the tag name and some additional information.
I'd like to be able to perform strict prefix queries. E.g. if I have a
tag named "updateChildren", I'd like to be able to write a query like
"updateC*" and get the list of tags that have this prefix
(updateChildren being one of them).

Is there a way to do this? Please suggest how I should store the field
for this purpose - which analyzer to use, whether to keep it stored,
etc.

Thanks,
Shlomy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching for a strict prefix

Posted by Shlomy Reinstein <sr...@gmail.com>.

Hi,

Thanks for the help.
BTW, if anyone is interested: I tried the same with KeywordAnalyzer -
added a field with value "m_szName", and tried to find it using
"FieldName:m_szN*" but failed. Someone in the Lucene IRC channel
showed me why - QueryParser, by default, lowercases all expanded terms
(e.g. those with '*' in them), so it actually looked for "m_szn*".

Thanks again,
Shlomy

On Mon, May 24, 2010 at 4:37 PM, Ian Lea <ia...@gmail.com> wrote:
> I bet it's that underscore in m_sz.  Different analyzers do different
> things with different punctuation characters.  I can never remember
> which does exactly what - it'll be in the javadocs or Lucene In Action
> or somewhere on the web.
>
> You can check what exactly has been indexed by using Luke - always a good idea.
>
> In this case you probably want an analyzer that just downcases the tag
> names.  Easy enough to build your own if there isn't an existing one
> that meets your needs.
>
>
> --
> Ian.
>
>
> On Mon, May 24, 2010 at 1:14 PM, Shlomy Reinstein <sr...@gmail.com> wrote:
>> Hi,
>>
>> Thanks. By "strict prefix", I meant a prefix of the name
>> (case-insensitive). What you suggest ("tagname: updateC*") was the
>> first thing I tried, but it happens to work only partially. In my
>> case, I have a lot of names beginning with "m_sz<something>", e.g.
>> "m_szComment", "m_szName". Trying a query like "tagname: m_szC*" will
>> not find the "m_szComment" value. I just wrote the wrong example here.
>>
>> Shlomy
>>
>> On Mon, May 24, 2010 at 11:32 AM, Ian Lea <ia...@gmail.com> wrote:
>>> StandardAnalyzer should work fine, mark the field as indexed, no need
>>> to store it unless you want to retrieve it for display.
>>>
>>> Query via QueryParser using "tagname: updateC*" or programatically via
>>> PrefixQuery.
>>>
>>>
>>> Although I'm not sure exactly what you mean by "strict prefix".  If
>>> you mean that the prefix should match exactly on case, punctuation
>>> etc. maybe use KeywordAnalyzer instead.
>>>
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>> On Sun, May 23, 2010 at 3:03 PM, Shlomy Reinstein <sr...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I have a Lucene index that contains source code tags (a tag can be any
>>>> named source code element - function, class, variable). Each document
>>>> contains a field with the tag name and some additional information.
>>>> I'd like to be able to perform strict prefix queries. E.g. if I have a
>>>> tag named "updateChildren", I'd like to be able to write a query like
>>>> "updateC*" and get the list of tags that have this prefix
>>>> (updateChildren being one of them).
>>>>
>>>> Is there a way to do this? Please suggest how I should store the field
>>>> for this purpose - which analyzer to use, whether to keep it stored,
>>>> etc.
>>>>
>>>> Thanks,
>>>> Shlomy
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching for a strict prefix

Posted by Ian Lea <ia...@gmail.com>.

I bet it's that underscore in m_sz.  Different analyzers do different
things with different punctuation characters.  I can never remember
which does exactly what - it'll be in the javadocs or Lucene In Action
or somewhere on the web.

You can check what exactly has been indexed by using Luke - always a good idea.

In this case you probably want an analyzer that just downcases the tag
names.  Easy enough to build your own if there isn't an existing one
that meets your needs.


--
Ian.


On Mon, May 24, 2010 at 1:14 PM, Shlomy Reinstein <sr...@gmail.com> wrote:
> Hi,
>
> Thanks. By "strict prefix", I meant a prefix of the name
> (case-insensitive). What you suggest ("tagname: updateC*") was the
> first thing I tried, but it happens to work only partially. In my
> case, I have a lot of names beginning with "m_sz<something>", e.g.
> "m_szComment", "m_szName". Trying a query like "tagname: m_szC*" will
> not find the "m_szComment" value. I just wrote the wrong example here.
>
> Shlomy
>
> On Mon, May 24, 2010 at 11:32 AM, Ian Lea <ia...@gmail.com> wrote:
>> StandardAnalyzer should work fine, mark the field as indexed, no need
>> to store it unless you want to retrieve it for display.
>>
>> Query via QueryParser using "tagname: updateC*" or programatically via
>> PrefixQuery.
>>
>>
>> Although I'm not sure exactly what you mean by "strict prefix".  If
>> you mean that the prefix should match exactly on case, punctuation
>> etc. maybe use KeywordAnalyzer instead.
>>
>>
>>
>> --
>> Ian.
>>
>>
>> On Sun, May 23, 2010 at 3:03 PM, Shlomy Reinstein <sr...@gmail.com> wrote:
>>> Hi,
>>>
>>> I have a Lucene index that contains source code tags (a tag can be any
>>> named source code element - function, class, variable). Each document
>>> contains a field with the tag name and some additional information.
>>> I'd like to be able to perform strict prefix queries. E.g. if I have a
>>> tag named "updateChildren", I'd like to be able to write a query like
>>> "updateC*" and get the list of tags that have this prefix
>>> (updateChildren being one of them).
>>>
>>> Is there a way to do this? Please suggest how I should store the field
>>> for this purpose - which analyzer to use, whether to keep it stored,
>>> etc.
>>>
>>> Thanks,
>>> Shlomy
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching for a strict prefix

Posted by Shlomy Reinstein <sr...@gmail.com>.

Hi,

Thanks. By "strict prefix", I meant a prefix of the name
(case-insensitive). What you suggest ("tagname: updateC*") was the
first thing I tried, but it happens to work only partially. In my
case, I have a lot of names beginning with "m_sz<something>", e.g.
"m_szComment", "m_szName". Trying a query like "tagname: m_szC*" will
not find the "m_szComment" value. I just wrote the wrong example here.

Shlomy

On Mon, May 24, 2010 at 11:32 AM, Ian Lea <ia...@gmail.com> wrote:
> StandardAnalyzer should work fine, mark the field as indexed, no need
> to store it unless you want to retrieve it for display.
>
> Query via QueryParser using "tagname: updateC*" or programatically via
> PrefixQuery.
>
>
> Although I'm not sure exactly what you mean by "strict prefix".  If
> you mean that the prefix should match exactly on case, punctuation
> etc. maybe use KeywordAnalyzer instead.
>
>
>
> --
> Ian.
>
>
> On Sun, May 23, 2010 at 3:03 PM, Shlomy Reinstein <sr...@gmail.com> wrote:
>> Hi,
>>
>> I have a Lucene index that contains source code tags (a tag can be any
>> named source code element - function, class, variable). Each document
>> contains a field with the tag name and some additional information.
>> I'd like to be able to perform strict prefix queries. E.g. if I have a
>> tag named "updateChildren", I'd like to be able to write a query like
>> "updateC*" and get the list of tags that have this prefix
>> (updateChildren being one of them).
>>
>> Is there a way to do this? Please suggest how I should store the field
>> for this purpose - which analyzer to use, whether to keep it stored,
>> etc.
>>
>> Thanks,
>> Shlomy
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching for a strict prefix

Posted by Ian Lea <ia...@gmail.com>.

StandardAnalyzer should work fine, mark the field as indexed, no need
to store it unless you want to retrieve it for display.

Query via QueryParser using "tagname: updateC*" or programatically via
PrefixQuery.


Although I'm not sure exactly what you mean by "strict prefix".  If
you mean that the prefix should match exactly on case, punctuation
etc. maybe use KeywordAnalyzer instead.



--
Ian.


On Sun, May 23, 2010 at 3:03 PM, Shlomy Reinstein <sr...@gmail.com> wrote:
> Hi,
>
> I have a Lucene index that contains source code tags (a tag can be any
> named source code element - function, class, variable). Each document
> contains a field with the tag name and some additional information.
> I'd like to be able to perform strict prefix queries. E.g. if I have a
> tag named "updateChildren", I'd like to be able to write a query like
> "updateC*" and get the list of tags that have this prefix
> (updateChildren being one of them).
>
> Is there a way to do this? Please suggest how I should store the field
> for this purpose - which analyzer to use, whether to keep it stored,
> etc.
>
> Thanks,
> Shlomy
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org