You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Varun Dhussa <va...@mapmyindia.com> on 2009/06/18 13:12:23 UTC
Fuzzy search change
Hi,
I wrote on this a long time ago, but haven't followed it up. I just
finished a C++ implementation of a spell check module in my software. I
borrowed the idea from Xapian. It is to use a trigram index to filter
results, and then use Edit Distance on the filtered set. Would such a
solution be acceptable to the Lucene Community? The details of my
implementation are as follows:
1) QDBM data store hash map
2) Trigram tokenizer on the input string
3) Data store hash(key,value) = (trigram, keyword_id_list<kw1...kwN)
4) Use trigram tokenizer and match with the trigram index
5) Get the IDs within the input cutoff
6) Run Edit Distance on the list and return
In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit,
it runs in <0.5 sec with a keyword record count of about 1,000,000
records. This is at least 3-4 times less than the current search times
on Lucene.
Since the results can be put in a thread safe hash table structure, the
trigram search can be distributed over a thread pool also.
Does this seem like a workable suggestion to the community?
Regards
--
Varun Dhussa
Product Architect
CE InfoSystems (P) Ltd
http://www.mapmyindia.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Fuzzy search change
Posted by eks dev <ek...@yahoo.co.uk>.
what would be the difference/benefit compared to standard lucene SpellChecker?
If I I am not wrong:
- Lucene SpellChecker uses standard lucene index as a storage for tokens instead of QDBM... meaning full inverted index with arbitrary N-grams length, with tf/idf/norms... not only HashMap<trigram, wordList>
- SC uses paradigm "give me N Best candidates (similarity)", not only "all above cutoff"... this Similarity depends (standard lucene Similarity) on N-Gram frequency, (one could even use some sexy norms to fine tune words...)...
If I've read your proposal correctly and did not miss something important, my suggestion would be to have a look at lucene SC (http://lucene.apache.org/java/2_3_2/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html) before you start
have fun,
eks
----- Original Message ----
> From: Michael McCandless <lu...@mikemccandless.com>
> To: java-dev@lucene.apache.org
> Sent: Thursday, 18 June, 2009 16:29:59
> Subject: Re: Fuzzy search change
>
> This would make an awesome addition to Lucene!
>
> This is similar to how Lucene's spellchecker identifies candidates, if
> I understand it right.
>
> Would you be able to port it to java?
>
> Mike
>
> On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussawrote:
> > Hi,
> >
> > I wrote on this a long time ago, but haven't followed it up. I just finished
> > a C++ implementation of a spell check module in my software. I borrowed the
> > idea from Xapian. It is to use a trigram index to filter results, and then
> > use Edit Distance on the filtered set. Would such a solution be acceptable
> > to the Lucene Community? The details of my implementation are as follows:
> >
> > 1) QDBM data store hash map
> > 2) Trigram tokenizer on the input string
> > 3) Data store hash(key,value) = (trigram, keyword_id_list
> > 4) Use trigram tokenizer and match with the trigram index
> > 5) Get the IDs within the input cutoff
> > 6) Run Edit Distance on the list and return
> >
> > In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
> > runs in <0.5 sec with a keyword record count of about 1,000,000 records.
> > This is at least 3-4 times less than the current search times on Lucene.
> >
> > Since the results can be put in a thread safe hash table structure, the
> > trigram search can be distributed over a thread pool also.
> >
> > Does this seem like a workable suggestion to the community?
> >
> > Regards
> >
> > --
> > Varun Dhussa
> > Product Architect
> > CE InfoSystems (P) Ltd
> > http://www.mapmyindia.com
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Fuzzy search change
Posted by Varun Dhussa <va...@mapmyindia.com>.
Hi,
I can port the code to java. I do not know the Lucene file structures
etc. as of now. So if someone with experience on that to store trigrams
and index them is can work on that part, I can port the rest of the code.
Regards
Varun Dhussa
Product Architect
CE InfoSystems (P) Ltd
http://www.mapmyindia.com
Michael McCandless wrote:
> This would make an awesome addition to Lucene!
>
> This is similar to how Lucene's spellchecker identifies candidates, if
> I understand it right.
>
> Would you be able to port it to java?
>
> Mike
>
> On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussa<va...@mapmyindia.com> wrote:
>
>> Hi,
>>
>> I wrote on this a long time ago, but haven't followed it up. I just finished
>> a C++ implementation of a spell check module in my software. I borrowed the
>> idea from Xapian. It is to use a trigram index to filter results, and then
>> use Edit Distance on the filtered set. Would such a solution be acceptable
>> to the Lucene Community? The details of my implementation are as follows:
>>
>> 1) QDBM data store hash map
>> 2) Trigram tokenizer on the input string
>> 3) Data store hash(key,value) = (trigram, keyword_id_list<kw1...kwN)
>> 4) Use trigram tokenizer and match with the trigram index
>> 5) Get the IDs within the input cutoff
>> 6) Run Edit Distance on the list and return
>>
>> In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
>> runs in <0.5 sec with a keyword record count of about 1,000,000 records.
>> This is at least 3-4 times less than the current search times on Lucene.
>>
>> Since the results can be put in a thread safe hash table structure, the
>> trigram search can be distributed over a thread pool also.
>>
>> Does this seem like a workable suggestion to the community?
>>
>> Regards
>>
>> --
>> Varun Dhussa
>> Product Architect
>> CE InfoSystems (P) Ltd
>> http://www.mapmyindia.com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Fuzzy search change
Posted by Michael McCandless <lu...@mikemccandless.com>.
This would make an awesome addition to Lucene!
This is similar to how Lucene's spellchecker identifies candidates, if
I understand it right.
Would you be able to port it to java?
Mike
On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussa<va...@mapmyindia.com> wrote:
> Hi,
>
> I wrote on this a long time ago, but haven't followed it up. I just finished
> a C++ implementation of a spell check module in my software. I borrowed the
> idea from Xapian. It is to use a trigram index to filter results, and then
> use Edit Distance on the filtered set. Would such a solution be acceptable
> to the Lucene Community? The details of my implementation are as follows:
>
> 1) QDBM data store hash map
> 2) Trigram tokenizer on the input string
> 3) Data store hash(key,value) = (trigram, keyword_id_list<kw1...kwN)
> 4) Use trigram tokenizer and match with the trigram index
> 5) Get the IDs within the input cutoff
> 6) Run Edit Distance on the list and return
>
> In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it
> runs in <0.5 sec with a keyword record count of about 1,000,000 records.
> This is at least 3-4 times less than the current search times on Lucene.
>
> Since the results can be put in a thread safe hash table structure, the
> trigram search can be distributed over a thread pool also.
>
> Does this seem like a workable suggestion to the community?
>
> Regards
>
> --
> Varun Dhussa
> Product Architect
> CE InfoSystems (P) Ltd
> http://www.mapmyindia.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org