You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hannes Carl Meyer <de...@rc.ag> on 2006/04/26 20:27:22 UTC
Dealing with acronyms
Hi All,
I would like enable users to do an acronym search on my index.
My idea is the following:
1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which
is going to be indexed)
2.) Store the extracted acronyms in a field, for example called "case"
3.) On search, asking the user to use case:"ABS" to search for acronyms
Any experience with this kind of pattern? Other ideas or best practices?
Thank you in advance and best regards
Hannes
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Dealing with acronyms
Posted by Stefan Will <st...@gmx.net>.
This makes perfect sense to me. Of course the hard part will be how to
extract the acronyms.
-- Stefan
Hannes Carl Meyer wrote:
> Hi All,
>
> I would like enable users to do an acronym search on my index.
> My idea is the following:
>
> 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document
> (which is going to be indexed)
>
> 2.) Store the extracted acronyms in a field, for example called "case"
>
> 3.) On search, asking the user to use case:"ABS" to search for acronyms
>
> Any experience with this kind of pattern? Other ideas or best practices?
>
> Thank you in advance and best regards
>
> Hannes
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Dealing with acronyms
Posted by Rajesh Munavalli <fi...@gmail.com>.
>
>
> So I guess its done by writing or extending an anylzer?
>
Yes...thats correct.
--Rajesh Munavalli
Blog: http://munavalli.blogspot.com
Re: Dealing with acronyms
Posted by Hannes Carl Meyer <de...@rc.ag>.
Rajesh Munavalli schrieb:
> On 4/26/06, Hannes Carl Meyer <de...@rc.ag> wrote:
>
>> Hi All,
>>
>> I would like enable users to do an acronym search on my index.
>> My idea is the following:
>>
>> 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which
>> is going to be indexed)
>>
>
>
> In case you havent already looked at, you might find this useful.
> http://www.cs.waikato.ac.nz/~nzdl/publications/1999/Yeates-Auto-Extract.pdf
>
>
> 2.) Store the extracted acronyms in a field, for example called "case"
>
>> 3.) On search, asking the user to use case:"ABS" to search for acronyms
>>
>
>
> I would rather store them in the same field with others, so that you can do
> phrase queries. Store the acronyms just like you would store synonyms. More
> information on how to store synonyms is in "Lucene in Action" book. This
> would facilitate queries like "USA President". If you store "USA" in a
> separate field, you wouldn't be able to match this query.
>
> Any experience with this kind of pattern? Other ideas or best practices?
>
> I would also look at HMMs/CRFs to extract acronyms. You need to come up with
> a list of features to identify a potential acronym. For ex:
> - All Caps
> - The acronym appears repeatedly in the rest of the text
> - Found in the acronym dictionary...etc
>
> Hope this helps,
>
> --Rajesh Munavalli
> Blog: http://munavalli.blogspot.com
>
>
Hi,
thank you, thats a good advice - I don't have the Lucene in Action Book,
but I think its worth taking a look at it.
So I guess its done by writing or extending an anylzer?
H.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Dealing with acronyms
Posted by Rajesh Munavalli <fi...@gmail.com>.
On 4/26/06, Hannes Carl Meyer <de...@rc.ag> wrote:
>
> Hi All,
>
> I would like enable users to do an acronym search on my index.
> My idea is the following:
>
> 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which
> is going to be indexed)
In case you havent already looked at, you might find this useful.
http://www.cs.waikato.ac.nz/~nzdl/publications/1999/Yeates-Auto-Extract.pdf
2.) Store the extracted acronyms in a field, for example called "case"
>
> 3.) On search, asking the user to use case:"ABS" to search for acronyms
I would rather store them in the same field with others, so that you can do
phrase queries. Store the acronyms just like you would store synonyms. More
information on how to store synonyms is in "Lucene in Action" book. This
would facilitate queries like "USA President". If you store "USA" in a
separate field, you wouldn't be able to match this query.
Any experience with this kind of pattern? Other ideas or best practices?
I would also look at HMMs/CRFs to extract acronyms. You need to come up with
a list of features to identify a potential acronym. For ex:
- All Caps
- The acronym appears repeatedly in the rest of the text
- Found in the acronym dictionary...etc
Hope this helps,
--Rajesh Munavalli
Blog: http://munavalli.blogspot.com