You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hannes Carl Meyer <de...@rc.ag> on 2006/04/26 20:27:22 UTC

Dealing with acronyms

Hi All,

I would like enable users to do an acronym search on my index.
My idea is the following:

1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which 
is going to be indexed)

2.) Store the extracted acronyms in a field, for example called "case"

3.) On search, asking the user to use case:"ABS" to search for acronyms

Any experience with this kind of pattern? Other ideas or best practices?

Thank you in advance and best regards

Hannes

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Dealing with acronyms

Posted by Stefan Will <st...@gmx.net>.
This makes perfect sense to me. Of course the hard part will be how to 
extract the acronyms.

-- Stefan

Hannes Carl Meyer wrote:
> Hi All,
>
> I would like enable users to do an acronym search on my index.
> My idea is the following:
>
> 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document 
> (which is going to be indexed)
>
> 2.) Store the extracted acronyms in a field, for example called "case"
>
> 3.) On search, asking the user to use case:"ABS" to search for acronyms
>
> Any experience with this kind of pattern? Other ideas or best practices?
>
> Thank you in advance and best regards
>
> Hannes
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Dealing with acronyms

Posted by Rajesh Munavalli <fi...@gmail.com>.
>
>
> So I guess its done by writing or extending an anylzer?
>
Yes...thats correct.

--Rajesh Munavalli
Blog: http://munavalli.blogspot.com

Re: Dealing with acronyms

Posted by Hannes Carl Meyer <de...@rc.ag>.
Rajesh Munavalli schrieb:
> On 4/26/06, Hannes Carl Meyer <de...@rc.ag> wrote:
>   
>> Hi All,
>>
>> I would like enable users to do an acronym search on my index.
>> My idea is the following:
>>
>> 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which
>> is going to be indexed)
>>     
>
>
> In case you havent already looked at, you might find this useful.
> http://www.cs.waikato.ac.nz/~nzdl/publications/1999/Yeates-Auto-Extract.pdf
>
>
> 2.) Store the extracted acronyms in a field, for example called "case"
>   
>> 3.) On search, asking the user to use case:"ABS" to search for acronyms
>>     
>
>
> I would rather store them in the same field with others, so that you can do
> phrase queries. Store the acronyms just like you would store synonyms. More
> information on how to store synonyms is in "Lucene in Action" book. This
> would facilitate queries like "USA President". If you store "USA" in a
> separate field, you wouldn't be able to match this query.
>
> Any experience with this kind of pattern? Other ideas or best practices?
>
> I would also look at HMMs/CRFs to extract acronyms. You need to come up with
> a list of features to identify a potential acronym. For ex:
> - All Caps
> - The acronym appears repeatedly in the rest of the text
> - Found in the acronym dictionary...etc
>
> Hope this helps,
>
> --Rajesh Munavalli
> Blog: http://munavalli.blogspot.com
>
>   
Hi,

thank you, thats a good advice - I don't have the Lucene in Action Book, 
but I think its worth taking a look at it.

So I guess its done by writing or extending an anylzer?

H.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Dealing with acronyms

Posted by Rajesh Munavalli <fi...@gmail.com>.
On 4/26/06, Hannes Carl Meyer <de...@rc.ag> wrote:
>
> Hi All,
>
> I would like enable users to do an acronym search on my index.
> My idea is the following:
>
> 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which
> is going to be indexed)


In case you havent already looked at, you might find this useful.
http://www.cs.waikato.ac.nz/~nzdl/publications/1999/Yeates-Auto-Extract.pdf


2.) Store the extracted acronyms in a field, for example called "case"
>
> 3.) On search, asking the user to use case:"ABS" to search for acronyms


I would rather store them in the same field with others, so that you can do
phrase queries. Store the acronyms just like you would store synonyms. More
information on how to store synonyms is in "Lucene in Action" book. This
would facilitate queries like "USA President". If you store "USA" in a
separate field, you wouldn't be able to match this query.

Any experience with this kind of pattern? Other ideas or best practices?

I would also look at HMMs/CRFs to extract acronyms. You need to come up with
a list of features to identify a potential acronym. For ex:
- All Caps
- The acronym appears repeatedly in the rest of the text
- Found in the acronym dictionary...etc

Hope this helps,

--Rajesh Munavalli
Blog: http://munavalli.blogspot.com