You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Varun Dhussa <va...@mapmyindia.com> on 2009/06/18 13:12:23 UTC

Fuzzy search change

Hi,

I wrote on this a long time ago, but haven't followed it up. I just 
finished a C++ implementation of a spell check module in my software. I 
borrowed the idea from Xapian. It is to use a trigram index to filter 
results, and then use Edit Distance on the filtered set. Would such a 
solution be acceptable to the Lucene Community? The details of my 
implementation are as follows:

1) QDBM data store hash map
2) Trigram tokenizer on the input string
3) Data store hash(key,value) = (trigram, keyword_id_list<kw1...kwN)
4) Use trigram tokenizer and match with the trigram index
5) Get the IDs within the input cutoff
6) Run Edit Distance on the list and return

In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, 
it runs in <0.5 sec with a keyword record count of about 1,000,000 
records. This is at least 3-4 times less than the current search times 
on Lucene.

Since the results can be put in a thread safe hash table structure, the 
trigram search can be distributed over a thread pool also.

Does this seem like a workable suggestion to the community?

Regards

-- 
Varun Dhussa
Product Architect
CE InfoSystems (P) Ltd
http://www.mapmyindia.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Fuzzy search change

Posted by eks dev <ek...@yahoo.co.uk>.
what would be the difference/benefit compared to standard lucene SpellChecker? 

If I I am not wrong:
- Lucene SpellChecker uses standard lucene index  as a storage for tokens instead of QDBM... meaning full inverted index with arbitrary N-grams length, with tf/idf/norms... not only HashMap<trigram, wordList> 

- SC uses paradigm "give me  N Best candidates (similarity)", not only "all above cutoff"... this Similarity depends (standard lucene Similarity) on N-Gram frequency, (one could even use some sexy norms to fine tune words...)...  

If I've read your proposal correctly and did not miss something important, my suggestion would be to have a look at lucene SC (http://lucene.apache.org/java/2_3_2/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html) before you start 
 

have fun, 
eks



----- Original Message ----
> From: Michael McCandless <lu...@mikemccandless.com>
> To: java-dev@lucene.apache.org
> Sent: Thursday, 18 June, 2009 16:29:59
> Subject: Re: Fuzzy search change
> 
> This would make an awesome addition to Lucene!
> 
> This is similar to how Lucene's spellchecker identifies candidates, if
> I understand it right.
> 
> Would you be able to port it to java?
> 
> Mike
> 
> On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussawrote:
> > Hi,
> >
> > I wrote on this a long time ago, but haven't followed it up. I just finished
> > a C++ implementation of a spell check module in my software. I borrowed the
> > idea from Xapian. It is to use a trigram index to filter results, and then
> > use Edit Distance on the filtered set. Would such a solution be acceptable
> > to the Lucene Community? The details of my implementation are as follows:
> >
> > 1) QDBM data store hash map
> > 2) Trigram tokenizer on the input string
> > 3) Data store hash(key,value) = (trigram, keyword_id_list
> > 4) Use trigram tokenizer and match with the trigram index
> > 5) Get the IDs within the input cutoff
> > 6) Run Edit Distance on the list and return
> >
> > In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
> > runs in <0.5 sec with a keyword record count of about 1,000,000 records.
> > This is at least 3-4 times less than the current search times on Lucene.
> >
> > Since the results can be put in a thread safe hash table structure, the
> > trigram search can be distributed over a thread pool also.
> >
> > Does this seem like a workable suggestion to the community?
> >
> > Regards
> >
> > --
> > Varun Dhussa
> > Product Architect
> > CE InfoSystems (P) Ltd
> > http://www.mapmyindia.com
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Fuzzy search change

Posted by Varun Dhussa <va...@mapmyindia.com>.
Hi,

I can port the code to java. I do not know the Lucene file structures
etc. as of now. So if someone with experience on that to store trigrams
and index them is can work on that part, I can port the rest of the code.

Regards

Varun Dhussa
Product Architect
CE InfoSystems (P) Ltd
http://www.mapmyindia.com



Michael McCandless wrote:
> This would make an awesome addition to Lucene!
>
> This is similar to how Lucene's spellchecker identifies candidates, if
> I understand it right.
>
> Would you be able to port it to java?
>
> Mike
>
> On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussa<va...@mapmyindia.com> wrote:
>   
>> Hi,
>>
>> I wrote on this a long time ago, but haven't followed it up. I just finished
>> a C++ implementation of a spell check module in my software. I borrowed the
>> idea from Xapian. It is to use a trigram index to filter results, and then
>> use Edit Distance on the filtered set. Would such a solution be acceptable
>> to the Lucene Community? The details of my implementation are as follows:
>>
>> 1) QDBM data store hash map
>> 2) Trigram tokenizer on the input string
>> 3) Data store hash(key,value) = (trigram, keyword_id_list<kw1...kwN)
>> 4) Use trigram tokenizer and match with the trigram index
>> 5) Get the IDs within the input cutoff
>> 6) Run Edit Distance on the list and return
>>
>> In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
>> runs in <0.5 sec with a keyword record count of about 1,000,000 records.
>> This is at least 3-4 times less than the current search times on Lucene.
>>
>> Since the results can be put in a thread safe hash table structure, the
>> trigram search can be distributed over a thread pool also.
>>
>> Does this seem like a workable suggestion to the community?
>>
>> Regards
>>
>> --
>> Varun Dhussa
>> Product Architect
>> CE InfoSystems (P) Ltd
>> http://www.mapmyindia.com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Fuzzy search change

Posted by Michael McCandless <lu...@mikemccandless.com>.
This would make an awesome addition to Lucene!

This is similar to how Lucene's spellchecker identifies candidates, if
I understand it right.

Would you be able to port it to java?

Mike

On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussa<va...@mapmyindia.com> wrote:
> Hi,
>
> I wrote on this a long time ago, but haven't followed it up. I just finished
> a C++ implementation of a spell check module in my software. I borrowed the
> idea from Xapian. It is to use a trigram index to filter results, and then
> use Edit Distance on the filtered set. Would such a solution be acceptable
> to the Lucene Community? The details of my implementation are as follows:
>
> 1) QDBM data store hash map
> 2) Trigram tokenizer on the input string
> 3) Data store hash(key,value) = (trigram, keyword_id_list<kw1...kwN)
> 4) Use trigram tokenizer and match with the trigram index
> 5) Get the IDs within the input cutoff
> 6) Run Edit Distance on the list and return
>
> In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
> runs in <0.5 sec with a keyword record count of about 1,000,000 records.
> This is at least 3-4 times less than the current search times on Lucene.
>
> Since the results can be put in a thread safe hash table structure, the
> trigram search can be distributed over a thread pool also.
>
> Does this seem like a workable suggestion to the community?
>
> Regards
>
> --
> Varun Dhussa
> Product Architect
> CE InfoSystems (P) Ltd
> http://www.mapmyindia.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org