You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Juan Caicedo <ju...@cavorite.com> on 2019/07/06 16:35:02 UTC

New feature idea - Backwards (FST) dictionary for approximate string search

Hello,

I've been working on a project for extending LevenshteinAutomata and
I'd like to know if it would be useful to add it to Lucene.

I've implemented the 'backwards dictionary' technique (see [1],
section 6) for speeding up approximate search. This technique allows
us to narrow down the search and, therefore, reduce the running time
(at the expense of using more memory).

I implemented it quite some time ago using an older version of Lucene,
so I need to revisit the code. However, the implementation was
relatively simple and it didn't require major changes to the core
classes. I can share the code in a public repository and iterate on
it, while I make it compatible for new Lucene APIs, add benchmarks,
and more unit tests.

Ideally, I'd like to contribute to Lucene, either as part of core,
suggest or a different module.

What do you think?

[1] https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New feature idea - Backwards (FST) dictionary for approximate string search

Posted by Michael Sokolov <ms...@gmail.com>.

You could create an alternate field in which you index the reverse tokens,
using a reverse token filter appended to the analysis chain of the main
field.

That leaves the question of how do you manage that in a service context,
and what kind of query makes use of two fields?

Otherwise you have to make a lower level change of some sort. I'm not sure
where you would do that cleanly.

On Wed, Jul 10, 2019, 7:40 AM Juan Caicedo <ju...@cavorite.com>
wrote:

> I didn’t integrate it directly with a Lucene index. I used the FST class
> (and related utilities) to build a stand-alone spellchecker.
>
> However, the FST for the term dictionary is modified at index time only.
> At query time, we need to create a variation of the Levemshtein automata
> and match it with the FST accordingly.
>
> I need to take a closer look at the Lucene code to see how can we
> integrate it with the existing modules. I’ll do that this week.
>
> On Tue, Jul 9, 2019 at 03:18 Michael McCandless <lu...@mikemccandless.com>
> wrote:
>
>> This sounds interesting!
>>
>> Did your patch build the FST at index time, so at search time all that
>> was needed was to load the FST from the index directory?
>>
>> Is the FST per-segment?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Jul 9, 2019 at 1:43 AM Juan Caicedo <ju...@cavorite.com>
>> wrote:
>>
>>> Hi Michael,
>>>
>>> I guess that I should have added more details :-)
>>>
>>> The main benefit from the technique is to do approximate search in
>>> large dictionaries. That is: find all the entries that are within 1 or
>>> 2 edit steps from the query. It's more useful for longer (starting
>>> from ~7 characters, iirc), but in general the technique can be applied
>>> to queries of any length. The main requirement is that we build the
>>> dictionary as an FST, using the backwards (reversed) keys of the
>>> original dictionary.
>>>
>>> I initially used the technique to implement a stand-alone
>>> spellchecker, but I think that it can also be used to optimize the
>>> Lucene fuzzy queries (e.g. for the spelling/suggest module). However,
>>> I'll need to look how can it be integrated with the part that creates
>>> the dictionary.
>>>
>>> I'll take a look at the code this week and I'll try to publish it in a
>>> public repository so that we can discuss about this with more concrete
>>> details.
>>>
>>> I skimmed the paper trying to understand possible applications of the
>>> technique. It sounds like efficient approximate (ie with some edits)
>>> substring search is the main idea? I don't believe such a query exists
>>> today in Lucene (nor any Suggester as far as I know). It sounds as if
>>> this would be useful for searching within large strings, eg DNA
>>> sequences or something like that, and maybe less applicable to typical
>>> "full text" (ie tokenized) search where the strings being searched are
>>> relatively shorter - does that sound right?
>>>
>>> On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov <ms...@gmail.com>
>>> wrote:
>>> >
>>> > Juan, that sounds intriguing.
>>> >
>>> > I skimmed the paper trying to understand possible applications of the
>>> > technique. It sounds like efficient approximate (ie with some edits)
>>> > substring search is the main idea? I don't believe such a query exists
>>> > today in Lucene (nor any Suggester as far as I know). It sounds as if
>>> > this would be useful for searching within large strings, eg DNA
>>> > sequences or something like that, and maybe less applicable to typical
>>> > "full text" (ie tokenized) search where the strings being searched are
>>> > relatively shorter - does that sound right?
>>> >
>>> > On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo <
>>> juan.caicedo@cavorite.com> wrote:
>>> > >
>>> > > Hello,
>>> > >
>>> > > I've been working on a project for extending LevenshteinAutomata and
>>> > > I'd like to know if it would be useful to add it to Lucene.
>>> > >
>>> > > I've implemented the 'backwards dictionary' technique (see [1],
>>> > > section 6) for speeding up approximate search. This technique allows
>>> > > us to narrow down the search and, therefore, reduce the running time
>>> > > (at the expense of using more memory).
>>> > >
>>> > > I implemented it quite some time ago using an older version of
>>> Lucene,
>>> > > so I need to revisit the code. However, the implementation was
>>> > > relatively simple and it didn't require major changes to the core
>>> > > classes. I can share the code in a public repository and iterate on
>>> > > it, while I make it compatible for new Lucene APIs, add benchmarks,
>>> > > and more unit tests.
>>> > >
>>> > > Ideally, I'd like to contribute to Lucene, either as part of core,
>>> > > suggest or a different module.
>>> > >
>>> > > What do you think?
>>> > >
>>> > > [1]
>>> https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf
>>> > >
>>> > > ---------------------------------------------------------------------
>>> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > > For additional commands, e-mail: dev-help@lucene.apache.org
>>> > >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

Re: New feature idea - Backwards (FST) dictionary for approximate string search

Posted by Juan Caicedo <ju...@cavorite.com>.

I didn’t integrate it directly with a Lucene index. I used the FST class
(and related utilities) to build a stand-alone spellchecker.

However, the FST for the term dictionary is modified at index time only. At
query time, we need to create a variation of the Levemshtein automata and
match it with the FST accordingly.

I need to take a closer look at the Lucene code to see how can we integrate
it with the existing modules. I’ll do that this week.

On Tue, Jul 9, 2019 at 03:18 Michael McCandless <lu...@mikemccandless.com>
wrote:

> This sounds interesting!
>
> Did your patch build the FST at index time, so at search time all that was
> needed was to load the FST from the index directory?
>
> Is the FST per-segment?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Jul 9, 2019 at 1:43 AM Juan Caicedo <ju...@cavorite.com>
> wrote:
>
>> Hi Michael,
>>
>> I guess that I should have added more details :-)
>>
>> The main benefit from the technique is to do approximate search in
>> large dictionaries. That is: find all the entries that are within 1 or
>> 2 edit steps from the query. It's more useful for longer (starting
>> from ~7 characters, iirc), but in general the technique can be applied
>> to queries of any length. The main requirement is that we build the
>> dictionary as an FST, using the backwards (reversed) keys of the
>> original dictionary.
>>
>> I initially used the technique to implement a stand-alone
>> spellchecker, but I think that it can also be used to optimize the
>> Lucene fuzzy queries (e.g. for the spelling/suggest module). However,
>> I'll need to look how can it be integrated with the part that creates
>> the dictionary.
>>
>> I'll take a look at the code this week and I'll try to publish it in a
>> public repository so that we can discuss about this with more concrete
>> details.
>>
>> I skimmed the paper trying to understand possible applications of the
>> technique. It sounds like efficient approximate (ie with some edits)
>> substring search is the main idea? I don't believe such a query exists
>> today in Lucene (nor any Suggester as far as I know). It sounds as if
>> this would be useful for searching within large strings, eg DNA
>> sequences or something like that, and maybe less applicable to typical
>> "full text" (ie tokenized) search where the strings being searched are
>> relatively shorter - does that sound right?
>>
>> On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov <ms...@gmail.com>
>> wrote:
>> >
>> > Juan, that sounds intriguing.
>> >
>> > I skimmed the paper trying to understand possible applications of the
>> > technique. It sounds like efficient approximate (ie with some edits)
>> > substring search is the main idea? I don't believe such a query exists
>> > today in Lucene (nor any Suggester as far as I know). It sounds as if
>> > this would be useful for searching within large strings, eg DNA
>> > sequences or something like that, and maybe less applicable to typical
>> > "full text" (ie tokenized) search where the strings being searched are
>> > relatively shorter - does that sound right?
>> >
>> > On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo <ju...@cavorite.com>
>> wrote:
>> > >
>> > > Hello,
>> > >
>> > > I've been working on a project for extending LevenshteinAutomata and
>> > > I'd like to know if it would be useful to add it to Lucene.
>> > >
>> > > I've implemented the 'backwards dictionary' technique (see [1],
>> > > section 6) for speeding up approximate search. This technique allows
>> > > us to narrow down the search and, therefore, reduce the running time
>> > > (at the expense of using more memory).
>> > >
>> > > I implemented it quite some time ago using an older version of Lucene,
>> > > so I need to revisit the code. However, the implementation was
>> > > relatively simple and it didn't require major changes to the core
>> > > classes. I can share the code in a public repository and iterate on
>> > > it, while I make it compatible for new Lucene APIs, add benchmarks,
>> > > and more unit tests.
>> > >
>> > > Ideally, I'd like to contribute to Lucene, either as part of core,
>> > > suggest or a different module.
>> > >
>> > > What do you think?
>> > >
>> > > [1]
>> https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: dev-help@lucene.apache.org
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: New feature idea - Backwards (FST) dictionary for approximate string search

Posted by Michael McCandless <lu...@mikemccandless.com>.

This sounds interesting!

Did your patch build the FST at index time, so at search time all that was
needed was to load the FST from the index directory?

Is the FST per-segment?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jul 9, 2019 at 1:43 AM Juan Caicedo <ju...@cavorite.com>
wrote:

> Hi Michael,
>
> I guess that I should have added more details :-)
>
> The main benefit from the technique is to do approximate search in
> large dictionaries. That is: find all the entries that are within 1 or
> 2 edit steps from the query. It's more useful for longer (starting
> from ~7 characters, iirc), but in general the technique can be applied
> to queries of any length. The main requirement is that we build the
> dictionary as an FST, using the backwards (reversed) keys of the
> original dictionary.
>
> I initially used the technique to implement a stand-alone
> spellchecker, but I think that it can also be used to optimize the
> Lucene fuzzy queries (e.g. for the spelling/suggest module). However,
> I'll need to look how can it be integrated with the part that creates
> the dictionary.
>
> I'll take a look at the code this week and I'll try to publish it in a
> public repository so that we can discuss about this with more concrete
> details.
>
> I skimmed the paper trying to understand possible applications of the
> technique. It sounds like efficient approximate (ie with some edits)
> substring search is the main idea? I don't believe such a query exists
> today in Lucene (nor any Suggester as far as I know). It sounds as if
> this would be useful for searching within large strings, eg DNA
> sequences or something like that, and maybe less applicable to typical
> "full text" (ie tokenized) search where the strings being searched are
> relatively shorter - does that sound right?
>
> On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov <ms...@gmail.com> wrote:
> >
> > Juan, that sounds intriguing.
> >
> > I skimmed the paper trying to understand possible applications of the
> > technique. It sounds like efficient approximate (ie with some edits)
> > substring search is the main idea? I don't believe such a query exists
> > today in Lucene (nor any Suggester as far as I know). It sounds as if
> > this would be useful for searching within large strings, eg DNA
> > sequences or something like that, and maybe less applicable to typical
> > "full text" (ie tokenized) search where the strings being searched are
> > relatively shorter - does that sound right?
> >
> > On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo <ju...@cavorite.com>
> wrote:
> > >
> > > Hello,
> > >
> > > I've been working on a project for extending LevenshteinAutomata and
> > > I'd like to know if it would be useful to add it to Lucene.
> > >
> > > I've implemented the 'backwards dictionary' technique (see [1],
> > > section 6) for speeding up approximate search. This technique allows
> > > us to narrow down the search and, therefore, reduce the running time
> > > (at the expense of using more memory).
> > >
> > > I implemented it quite some time ago using an older version of Lucene,
> > > so I need to revisit the code. However, the implementation was
> > > relatively simple and it didn't require major changes to the core
> > > classes. I can share the code in a public repository and iterate on
> > > it, while I make it compatible for new Lucene APIs, add benchmarks,
> > > and more unit tests.
> > >
> > > Ideally, I'd like to contribute to Lucene, either as part of core,
> > > suggest or a different module.
> > >
> > > What do you think?
> > >
> > > [1]
> https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: dev-help@lucene.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: New feature idea - Backwards (FST) dictionary for approximate string search

Posted by Juan Caicedo <ju...@cavorite.com>.

Hi Michael,

I guess that I should have added more details :-)

The main benefit from the technique is to do approximate search in
large dictionaries. That is: find all the entries that are within 1 or
2 edit steps from the query. It's more useful for longer (starting
from ~7 characters, iirc), but in general the technique can be applied
to queries of any length. The main requirement is that we build the
dictionary as an FST, using the backwards (reversed) keys of the
original dictionary.

I initially used the technique to implement a stand-alone
spellchecker, but I think that it can also be used to optimize the
Lucene fuzzy queries (e.g. for the spelling/suggest module). However,
I'll need to look how can it be integrated with the part that creates
the dictionary.

I'll take a look at the code this week and I'll try to publish it in a
public repository so that we can discuss about this with more concrete
details.

I skimmed the paper trying to understand possible applications of the
technique. It sounds like efficient approximate (ie with some edits)
substring search is the main idea? I don't believe such a query exists
today in Lucene (nor any Suggester as far as I know). It sounds as if
this would be useful for searching within large strings, eg DNA
sequences or something like that, and maybe less applicable to typical
"full text" (ie tokenized) search where the strings being searched are
relatively shorter - does that sound right?

On Sat, Jul 6, 2019 at 2:01 PM Michael Sokolov <ms...@gmail.com> wrote:
>
> Juan, that sounds intriguing.
>
> I skimmed the paper trying to understand possible applications of the
> technique. It sounds like efficient approximate (ie with some edits)
> substring search is the main idea? I don't believe such a query exists
> today in Lucene (nor any Suggester as far as I know). It sounds as if
> this would be useful for searching within large strings, eg DNA
> sequences or something like that, and maybe less applicable to typical
> "full text" (ie tokenized) search where the strings being searched are
> relatively shorter - does that sound right?
>
> On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo <ju...@cavorite.com> wrote:
> >
> > Hello,
> >
> > I've been working on a project for extending LevenshteinAutomata and
> > I'd like to know if it would be useful to add it to Lucene.
> >
> > I've implemented the 'backwards dictionary' technique (see [1],
> > section 6) for speeding up approximate search. This technique allows
> > us to narrow down the search and, therefore, reduce the running time
> > (at the expense of using more memory).
> >
> > I implemented it quite some time ago using an older version of Lucene,
> > so I need to revisit the code. However, the implementation was
> > relatively simple and it didn't require major changes to the core
> > classes. I can share the code in a public repository and iterate on
> > it, while I make it compatible for new Lucene APIs, add benchmarks,
> > and more unit tests.
> >
> > Ideally, I'd like to contribute to Lucene, either as part of core,
> > suggest or a different module.
> >
> > What do you think?
> >
> > [1] https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New feature idea - Backwards (FST) dictionary for approximate string search

Posted by Michael Sokolov <ms...@gmail.com>.

Juan, that sounds intriguing.

I skimmed the paper trying to understand possible applications of the
technique. It sounds like efficient approximate (ie with some edits)
substring search is the main idea? I don't believe such a query exists
today in Lucene (nor any Suggester as far as I know). It sounds as if
this would be useful for searching within large strings, eg DNA
sequences or something like that, and maybe less applicable to typical
"full text" (ie tokenized) search where the strings being searched are
relatively shorter - does that sound right?

On Sat, Jul 6, 2019 at 12:35 PM Juan Caicedo <ju...@cavorite.com> wrote:
>
> Hello,
>
> I've been working on a project for extending LevenshteinAutomata and
> I'd like to know if it would be useful to add it to Lucene.
>
> I've implemented the 'backwards dictionary' technique (see [1],
> section 6) for speeding up approximate search. This technique allows
> us to narrow down the search and, therefore, reduce the running time
> (at the expense of using more memory).
>
> I implemented it quite some time ago using an older version of Lucene,
> so I need to revisit the code. However, the implementation was
> relatively simple and it didn't require major changes to the core
> classes. I can share the code in a public repository and iterate on
> it, while I make it compatible for new Lucene APIs, add benchmarks,
> and more unit tests.
>
> Ideally, I'd like to contribute to Lucene, either as part of core,
> suggest or a different module.
>
> What do you think?
>
> [1] https://www.cis.uni-muenchen.de/download/publikationen/fastapproxsearch.pdf
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org