You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2015/09/13 20:59:56 UTC

How to handle big dictionaries to find typos

Hello,

I have created a very big dictionary of companies, it is around 3M.
At the moment i am using DictionaryNameFinder class, but I need to
implement something to find typos like Gogle/Gooogle Inc etc.
I read something about leveinstain distance, is this implementend in
OpenNLP?
It seems good but i read it takes a lot of times if the words are many (my
case).

What should i do?
Thanks!
Damiano

Re: How to handle big dictionaries to find typos

Posted by Damiano Porta <da...@gmail.com>.
Yes Catalin, I was using DictionaryNameFinder for NER. But unfortunately it
does not support misspellings at the moment. So i have to migrate that
dictionary to a Lucene Index.

Thank you!

2015-09-14 14:46 GMT+02:00 Cătălin M. <ca...@gmail.com>:

> Yes, you have right. You can replace DictionaryNameFinder with a Lucene
> index. When you mentioned DictionaryNameFinder I was thinking at Name
> entity recognition module (tagging being done using a NER model).
>
> Sorry for this misunderstanding.
>
> BR,
> Catalin
>
>
> On 09/14/2015 03:31 PM, Damiano Porta wrote:
>
>> HI Catalin,
>> than you so much for you help.
>>
>> Yes I found Lucene's FuzzyQuery, but i did not understand one passage.
>> When
>> I check the term (with typos) against a Lucene Index to find the correct
>> form, why do I have to use DictionaryNameFinder? I mean..
>>
>> 1. I can create an index with all the correct names
>> 2. CHecking each token against that index to find a match or a word (with
>> a
>> specific "distance")
>> 3. If I found something i "tag" that word as city without using
>> DictionaryNameFinder.
>>
>> I mean, my "dictionary" will be this Lucene's index.
>> Correct?
>>
>> Thank you!
>> Damiano
>>
>>
>>
>> 2015-09-14 13:10 GMT+02:00 Cătălin M. <ca...@gmail.com>:
>>
>> A solution might be to check typos (Gogle, Gooogle) against a Lucene index
>>> that would contain your dictionary of companies, too. Using the
>>> FuzzyQuery
>>> you would find the correct form => "Google" and then use this correct orm
>>> in your DictionaryNameFinder.
>>>
>>> Please let me know if it seems feasible.
>>>
>>> BR,
>>> Catalin
>>>
>>>
>>>
>>> On 09/13/2015 10:35 PM, Damiano Porta wrote:
>>>
>>> Hi Catalin,
>>>> Can i use it with DictionaryNameFinder?
>>>> Thanks
>>>> Damiano
>>>>
>>>> Il giorno Dom 13 Set 2015 21:08 Catalin Mititelu <
>>>> catalinmititelu@gmail.com>
>>>> ha scritto:
>>>>
>>>> Hi Damiano,
>>>>
>>>>> You may try Lucene fuzzy query which is based on Levenstein distance.
>>>>>
>>>>> BR,
>>>>> Catalin
>>>>>
>>>>> On 09/13/2015 09:59 PM, Damiano Porta wrote:
>>>>>
>>>>> Hello,
>>>>>>
>>>>>> I have created a very big dictionary of companies, it is around 3M.
>>>>>> At the moment i am using DictionaryNameFinder class, but I need to
>>>>>> implement something to find typos like Gogle/Gooogle Inc etc.
>>>>>> I read something about leveinstain distance, is this implementend in
>>>>>> OpenNLP?
>>>>>> It seems good but i read it takes a lot of times if the words are many
>>>>>>
>>>>>> (my
>>>>>
>>>>> case).
>>>>>>
>>>>>> What should i do?
>>>>>> Thanks!
>>>>>> Damiano
>>>>>>
>>>>>>
>>>>>>
>

Re: How to handle big dictionaries to find typos

Posted by "Cătălin M." <ca...@gmail.com>.
Yes, you have right. You can replace DictionaryNameFinder with a Lucene index. When you mentioned DictionaryNameFinder I was thinking at Name entity recognition module (tagging being done using a NER 
model).

Sorry for this misunderstanding.

BR,
Catalin

On 09/14/2015 03:31 PM, Damiano Porta wrote:
> HI Catalin,
> than you so much for you help.
>
> Yes I found Lucene's FuzzyQuery, but i did not understand one passage. When
> I check the term (with typos) against a Lucene Index to find the correct
> form, why do I have to use DictionaryNameFinder? I mean..
>
> 1. I can create an index with all the correct names
> 2. CHecking each token against that index to find a match or a word (with a
> specific "distance")
> 3. If I found something i "tag" that word as city without using
> DictionaryNameFinder.
>
> I mean, my "dictionary" will be this Lucene's index.
> Correct?
>
> Thank you!
> Damiano
>
>
>
> 2015-09-14 13:10 GMT+02:00 Cătălin M. <ca...@gmail.com>:
>
>> A solution might be to check typos (Gogle, Gooogle) against a Lucene index
>> that would contain your dictionary of companies, too. Using the FuzzyQuery
>> you would find the correct form => "Google" and then use this correct orm
>> in your DictionaryNameFinder.
>>
>> Please let me know if it seems feasible.
>>
>> BR,
>> Catalin
>>
>>
>>
>> On 09/13/2015 10:35 PM, Damiano Porta wrote:
>>
>>> Hi Catalin,
>>> Can i use it with DictionaryNameFinder?
>>> Thanks
>>> Damiano
>>>
>>> Il giorno Dom 13 Set 2015 21:08 Catalin Mititelu <
>>> catalinmititelu@gmail.com>
>>> ha scritto:
>>>
>>> Hi Damiano,
>>>> You may try Lucene fuzzy query which is based on Levenstein distance.
>>>>
>>>> BR,
>>>> Catalin
>>>>
>>>> On 09/13/2015 09:59 PM, Damiano Porta wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have created a very big dictionary of companies, it is around 3M.
>>>>> At the moment i am using DictionaryNameFinder class, but I need to
>>>>> implement something to find typos like Gogle/Gooogle Inc etc.
>>>>> I read something about leveinstain distance, is this implementend in
>>>>> OpenNLP?
>>>>> It seems good but i read it takes a lot of times if the words are many
>>>>>
>>>> (my
>>>>
>>>>> case).
>>>>>
>>>>> What should i do?
>>>>> Thanks!
>>>>> Damiano
>>>>>
>>>>>


Re: How to handle big dictionaries to find typos

Posted by Damiano Porta <da...@gmail.com>.
HI Catalin,
than you so much for you help.

Yes I found Lucene's FuzzyQuery, but i did not understand one passage. When
I check the term (with typos) against a Lucene Index to find the correct
form, why do I have to use DictionaryNameFinder? I mean..

1. I can create an index with all the correct names
2. CHecking each token against that index to find a match or a word (with a
specific "distance")
3. If I found something i "tag" that word as city without using
DictionaryNameFinder.

I mean, my "dictionary" will be this Lucene's index.
Correct?

Thank you!
Damiano



2015-09-14 13:10 GMT+02:00 Cătălin M. <ca...@gmail.com>:

> A solution might be to check typos (Gogle, Gooogle) against a Lucene index
> that would contain your dictionary of companies, too. Using the FuzzyQuery
> you would find the correct form => "Google" and then use this correct orm
> in your DictionaryNameFinder.
>
> Please let me know if it seems feasible.
>
> BR,
> Catalin
>
>
>
> On 09/13/2015 10:35 PM, Damiano Porta wrote:
>
>> Hi Catalin,
>> Can i use it with DictionaryNameFinder?
>> Thanks
>> Damiano
>>
>> Il giorno Dom 13 Set 2015 21:08 Catalin Mititelu <
>> catalinmititelu@gmail.com>
>> ha scritto:
>>
>> Hi Damiano,
>>>
>>> You may try Lucene fuzzy query which is based on Levenstein distance.
>>>
>>> BR,
>>> Catalin
>>>
>>> On 09/13/2015 09:59 PM, Damiano Porta wrote:
>>>
>>>> Hello,
>>>>
>>>> I have created a very big dictionary of companies, it is around 3M.
>>>> At the moment i am using DictionaryNameFinder class, but I need to
>>>> implement something to find typos like Gogle/Gooogle Inc etc.
>>>> I read something about leveinstain distance, is this implementend in
>>>> OpenNLP?
>>>> It seems good but i read it takes a lot of times if the words are many
>>>>
>>> (my
>>>
>>>> case).
>>>>
>>>> What should i do?
>>>> Thanks!
>>>> Damiano
>>>>
>>>>
>>>
>

Re: How to handle big dictionaries to find typos

Posted by "Cătălin M." <ca...@gmail.com>.
A solution might be to check typos (Gogle, Gooogle) against a Lucene index that would contain your dictionary of companies, too. Using the FuzzyQuery you would find the correct form => "Google" and 
then use this correct orm in your DictionaryNameFinder.

Please let me know if it seems feasible.

BR,
Catalin


On 09/13/2015 10:35 PM, Damiano Porta wrote:
> Hi Catalin,
> Can i use it with DictionaryNameFinder?
> Thanks
> Damiano
>
> Il giorno Dom 13 Set 2015 21:08 Catalin Mititelu <ca...@gmail.com>
> ha scritto:
>
>> Hi Damiano,
>>
>> You may try Lucene fuzzy query which is based on Levenstein distance.
>>
>> BR,
>> Catalin
>>
>> On 09/13/2015 09:59 PM, Damiano Porta wrote:
>>> Hello,
>>>
>>> I have created a very big dictionary of companies, it is around 3M.
>>> At the moment i am using DictionaryNameFinder class, but I need to
>>> implement something to find typos like Gogle/Gooogle Inc etc.
>>> I read something about leveinstain distance, is this implementend in
>>> OpenNLP?
>>> It seems good but i read it takes a lot of times if the words are many
>> (my
>>> case).
>>>
>>> What should i do?
>>> Thanks!
>>> Damiano
>>>
>>


Re: How to handle big dictionaries to find typos

Posted by Damiano Porta <da...@gmail.com>.
Hi Catalin,
Can i use it with DictionaryNameFinder?
Thanks
Damiano

Il giorno Dom 13 Set 2015 21:08 Catalin Mititelu <ca...@gmail.com>
ha scritto:

> Hi Damiano,
>
> You may try Lucene fuzzy query which is based on Levenstein distance.
>
> BR,
> Catalin
>
> On 09/13/2015 09:59 PM, Damiano Porta wrote:
> > Hello,
> >
> > I have created a very big dictionary of companies, it is around 3M.
> > At the moment i am using DictionaryNameFinder class, but I need to
> > implement something to find typos like Gogle/Gooogle Inc etc.
> > I read something about leveinstain distance, is this implementend in
> > OpenNLP?
> > It seems good but i read it takes a lot of times if the words are many
> (my
> > case).
> >
> > What should i do?
> > Thanks!
> > Damiano
> >
>
>

Re: How to handle big dictionaries to find typos

Posted by Catalin Mititelu <ca...@gmail.com>.
Hi Damiano,

You may try Lucene fuzzy query which is based on Levenstein distance.

BR,
Catalin

On 09/13/2015 09:59 PM, Damiano Porta wrote:
> Hello,
>
> I have created a very big dictionary of companies, it is around 3M.
> At the moment i am using DictionaryNameFinder class, but I need to
> implement something to find typos like Gogle/Gooogle Inc etc.
> I read something about leveinstain distance, is this implementend in
> OpenNLP?
> It seems good but i read it takes a lot of times if the words are many (my
> case).
>
> What should i do?
> Thanks!
> Damiano
>