You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Nik Everett (JIRA)" <ji...@apache.org> on 2014/05/28 12:58:03 UTC
[jira] [Commented] (LUCENE-4556) FuzzyTermsEnum creates tons of objects

    [ https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011014#comment-14011014 ] 

Nik Everett commented on LUCENE-4556:
-------------------------------------

I'm having GC trouble and I'm using the DirectCandidateGenerator.  Its obviously kind of hard to tell how much the automata is contributing in production but when I try it locally just generating the automata for two or three terms takes about 200KB of memory.  Napkin math (200KB * 250queries/second) says this makes about 50MB of garbage per second per index.  Obviously it gets worse if you run this in a sharded context where each shard does the generating.  Well, not really worse, but the large up front cost and memory consumption of this process is relatively static based on shard size so this becomes a reason to use larger shards. 

I should propose that in addition to Simon's patches another other option is to try to implement something like the stack based automaton simulation that the Schulz Mihov paper (the one that proposed the Lev automaton) describes in section 6.  Its not useful for stuff like intersecting the enums but if you are willing to forgo that you could probably get away with much less memory consumption.

> FuzzyTermsEnum creates tons of objects
> --------------------------------------
>
>                 Key: LUCENE-4556
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4556
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search, modules/spellchecker
>    Affects Versions: 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael McCandless
>            Priority: Critical
>             Fix For: 4.9, 5.0
>
>         Attachments: LUCENE-4556.patch, LUCENE-4556.patch
>
>
> I ran into this problem in production using the DirectSpellchecker. The number of objects created by the spellchecker shoot through the roof very very quickly. We ran about 130 queries and ended up with > 2M transitions / states. We spend 50% of the time in GC just because of transitions. Other parts of the system behave just fine here.
> I talked quickly to robert and gave a POC a shot providing a LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case and build a array based strucuture converted into UTF-8 directly instead of going through the object based APIs. This involved quite a bit of changes but they are all package private at this point. I have a patch that still has a fair set of nocommits but its shows that its possible and IMO worth the trouble to make this really useable in production. All tests pass with the patch - its a start....



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org