You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Olivier Binda <ol...@wanadoo.fr> on 2015/01/09 12:42:56 UTC

Finding a match for an automaton against a FST

Hello.

1) What is the best way to check if an automaton (from a regex or a 
string with a wildcard)
has at least 1 match against a FST (from a WFSTCompletionLookup) ?

Is there a simple way to do that ?


2) Also, is there a simple/efficient way to find the lowest and the 
highest arcs of a FST that match against  an automaton ?

Best regards,
Olivier



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Finding a match for an automaton against a FST

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Sat, Jan 10, 2015 at 8:23 AM, Olivier Binda <ol...@wanadoo.fr> wrote:
> On 01/10/2015 11:00 AM, Michael McCandless wrote:
>>
>> On Fri, Jan 9, 2015 at 6:42 AM, Olivier Binda <ol...@wanadoo.fr>
>> wrote:
>>>
>>> Hello.
>>>
>>> 1) What is the best way to check if an automaton (from a regex or a
>>> string
>>> with a wildcard)
>>> has at least 1 match against a FST (from a WFSTCompletionLookup) ?
>>
>> You need to implement "intersect".  We already have this method for
>> two automata (Operations.java); maybe you can start from that but
>> cutover to the FST APIs instead for the 2nd automaton?
>
>
> I looked a bit into this. This is complicated stuff :/

Sorry, yes it is.  If you have any ideas to simplify the APIs that
would be awesome :)

> I think I get what the nested loops in intersect() do  : transitions consist
> of a double dimension array, somehow those arays are intersected.
> I don't understand yet why is there a .min and a .max for a transition (why
> not just a codepoint ?)

Most automaton transitions cover a wide range of unicode characters,
so requiring a separate transition for each would be too costly (too
many objects / RAM).

> Fst and automaton (and maybee the lucene codec stuff) are 3 different
> implementations
> of finite state machine/transducers, right ?

I think we have only 2 implementations (FST, Automaton).

> How does regexQuery (automaton) match against an index ?
> Does it use intersect() internally ?  (if it does, maybee I could reuse that
> code too)

RegexpQuery in core (NOT to be confused with the much slower,
differing in name by only one letter, RegexQuery in sandbox) builds an
Automaton and then uses Terms.intersect API.

However, I would not look for inspiration from Terms.intersect: that
implementation (in block tree terms dict) works with the terms
dictionary data structures to perform a fast intersection and that
code is crazy complex.

Possibly a place to look for inspiration/poaching is
FSTUtil.intersectPrefixPaths: that intersects an automaton with an
FST.  It's used by the fuzzy suggester...

Mike McCandless

http://blog.mikemccandless.com



>
>
>>
>>> 2) Also, is there a simple/efficient way to find the lowest and the
>>> highest
>>> arcs of a FST that match against  an automaton ?
>>
>> Hmm arcs leaving which state?  The initial state?  You could simply
>> walk all arcs leaving the initial state from the FST and check if the
>> automaton accepts them leaving its initial state (assuming the
>> automaton has no dead states)?
>>
>> Or, if you are already doing an intersection here, just save this
>> information as a side effect since you will have already computed it.
>
>
> thanks for the tips, it helps.
> Olivier
>
>
>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Finding a match for an automaton against a FST

Posted by Olivier Binda <ol...@wanadoo.fr>.
On 01/10/2015 11:00 AM, Michael McCandless wrote:
> On Fri, Jan 9, 2015 at 6:42 AM, Olivier Binda <ol...@wanadoo.fr> wrote:
>> Hello.
>>
>> 1) What is the best way to check if an automaton (from a regex or a string
>> with a wildcard)
>> has at least 1 match against a FST (from a WFSTCompletionLookup) ?
> You need to implement "intersect".  We already have this method for
> two automata (Operations.java); maybe you can start from that but
> cutover to the FST APIs instead for the 2nd automaton?

I looked a bit into this. This is complicated stuff :/

I think I get what the nested loops in intersect() do  : transitions 
consist of a double dimension array, somehow those arays are intersected.
I don't understand yet why is there a .min and a .max for a transition 
(why not just a codepoint ?)

Fst and automaton (and maybee the lucene codec stuff) are 3 different 
implementations
of finite state machine/transducers, right ?

How does regexQuery (automaton) match against an index ?
Does it use intersect() internally ?  (if it does, maybee I could reuse 
that code too)



>
>> 2) Also, is there a simple/efficient way to find the lowest and the highest
>> arcs of a FST that match against  an automaton ?
> Hmm arcs leaving which state?  The initial state?  You could simply
> walk all arcs leaving the initial state from the FST and check if the
> automaton accepts them leaving its initial state (assuming the
> automaton has no dead states)?
>
> Or, if you are already doing an intersection here, just save this
> information as a side effect since you will have already computed it.

thanks for the tips, it helps.
Olivier



> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Finding a match for an automaton against a FST

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Fri, Jan 9, 2015 at 6:42 AM, Olivier Binda <ol...@wanadoo.fr> wrote:
> Hello.
>
> 1) What is the best way to check if an automaton (from a regex or a string
> with a wildcard)
> has at least 1 match against a FST (from a WFSTCompletionLookup) ?

You need to implement "intersect".  We already have this method for
two automata (Operations.java); maybe you can start from that but
cutover to the FST APIs instead for the 2nd automaton?

> 2) Also, is there a simple/efficient way to find the lowest and the highest
> arcs of a FST that match against  an automaton ?

Hmm arcs leaving which state?  The initial state?  You could simply
walk all arcs leaving the initial state from the FST and check if the
automaton accepts them leaving its initial state (assuming the
automaton has no dead states)?

Or, if you are already doing an intersection here, just save this
information as a side effect since you will have already computed it.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org