You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Haoyu Zhai (Jira)" <ji...@apache.org> on 2021/05/31 21:08:00 UTC

[jira] [Comment Edited] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

    [ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354659#comment-17354659 ] 

Haoyu Zhai edited comment on LUCENE-9983 at 5/31/21, 9:07 PM:
--------------------------------------------------------------

Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on {{IntIntHashMap}}: [https://github.com/apache/lucene/pull/163]

I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it successfully reduce the time it need before throwing the exception from 6 min to 16 sec on my local machine (they both stoped at the same point as well).

I still kept the state array to be sorted when get it, so we'll be slower when actually getting array but way faster on putting/removing keys. I'm not quite sure why the speed up is this much, but my guess is we're doing way more operations and spending way more times on increasing/decreasing state count and putting/removing states from the set than introducing new states?


was (Author: zhai7631):
Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on {{IntIntHashMap}}: [https://github.com/apache/lucene/pull/162]

I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it successfully reduce the time it need before throwing the exception from 6 min to 16 sec on my local machine (they both stoped at the same point as well).

I still kept the state array to be sorted when get it, so we'll be slower when actually getting array but way faster on putting/removing keys. I'm not quite sure why the speed up is this much, but my guess is we're doing way more operations and spending way more times on increasing/decreasing state count and putting/removing states from the set than introducing new states?

> Stop sorting determinize powersets unnecessarily
> ------------------------------------------------
>
>                 Key: LUCENE-9983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9983
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all subsets of NFA states that "belong" in the same determinized state, using [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep these growing maps of int key, int value sorted by key, e.g. upgrading to a {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the ability to add/delete keys from the map, and hashCode / equals (by key only – ignoring value!), and to freeze the map (a small optimization that we could skip initially).  We only use these maps to lookup in the (growing) determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from [HPPC|https://github.com/carrotsearch/hppc]?  And then change its {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org