You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Haoyu Zhai (Jira)" <ji...@apache.org> on 2021/07/15 06:01:00 UTC

[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?

    [ https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381057#comment-17381057 ] 

Haoyu Zhai commented on LUCENE-10010:
-------------------------------------

??With an NFA, we'd be forced to match every term in the term dictionary? Or what I missing something here??
 As far as I understand (which might be wrong), the way we're currently use DFA to intersect with term dictionary is to provide a initial term (which might be null), then based on that term find the next acceptable term in lexicographic order. I think this can still be done using an NFA? What in my mind is like doing a partial determinize process by always taking the smallest unvisited transition until an accept state is reached.

I think there're mainly two benefits we can get if we have this new kind of query:
 # A possibly better performance when query are not reusable: what we do today is we determinize upfront and use DFA at search time, so if the determinized query could be reused, then the determinize cost could be amortized to nearly zero. But if on the opposite, then we have to pay the whole determinization cost every time. While in the NFA query, ideally we don't need to determinize the whole NFA every time, so it could be faster than the DFA query. An extreme case is that for an empty index, DFA query still need to determinize while NFA query don't need it at all.
 # As [~mikemccand] mentioned above, we could be more resilient against ReDoS attack

I can try to get some code work done and benchmark to see whether point 1 holds.
  

 

> Should we have a NFA Query?
> ---------------------------
>
>                 Key: LUCENE-10010
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10010
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/search
>    Affects Versions: main (9.0)
>            Reporter: Haoyu Zhai
>            Priority: Major
>
> Today when a {{RegexpQuery}} is created, it will be translated to NFA, determinized to DFA and eventually become an {{AutomatonQuery}}, which is very fast. However, not every NFA could be determinized to DFA easily, the example given in LUCENE-9981 showed how easy could a short regexp break the determinize process.
> Maybe, instead of marking those kind of queries as adversarial cases, we could make a new kind of NFA query, which execute directly on NFA and thus no need to worry about determinize process or determinized DFA size. It should be slower, but also makes those adversarial cases doable.
> [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a simple but efficient way of searching over NFA, essentially it is a partial determinize process that only determinize the necessary part of DFA. Maybe we could give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org