You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Haoyu Zhai (Jira)" <ji...@apache.org> on 2021/09/07 17:42:00 UTC

[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?

    [ https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411388#comment-17411388 ] 

Haoyu Zhai commented on LUCENE-10010:
-------------------------------------

I've done a very basic benchmark using the current "Wildcard" task based on current PR revision.

On wiki10k index, I see a ~6% qps improvement (295 vs 313).

On wikiall index, I see a ~50% qps degradation (40 vs 20).

Also on wikiall index, the JFR helps me identified that the {{getCharClass}} is the biggest hotspot (we have optimized that in DFA cases by using a 256 length array to map the char to char class, in NFA case I haven't added that optimization yet and we do a binary search each time a char coming), I'll try to optimize the current PR and see what number we can get at the end. And also I'll create a task using more complex regex, current Wildcard task is too simple.

> Should we have a NFA Query?
> ---------------------------
>
>                 Key: LUCENE-10010
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10010
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/search
>    Affects Versions: main (9.0)
>            Reporter: Haoyu Zhai
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Today when a {{RegexpQuery}} is created, it will be translated to NFA, determinized to DFA and eventually become an {{AutomatonQuery}}, which is very fast. However, not every NFA could be determinized to DFA easily, the example given in LUCENE-9981 showed how easy could a short regexp break the determinize process.
> Maybe, instead of marking those kind of queries as adversarial cases, we could make a new kind of NFA query, which execute directly on NFA and thus no need to worry about determinize process or determinized DFA size. It should be slower, but also makes those adversarial cases doable.
> [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a simple but efficient way of searching over NFA, essentially it is a partial determinize process that only determinize the necessary part of DFA. Maybe we could give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org