You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2014/05/10 23:56:41 UTC

[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

    [ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993701#comment-13993701 ] 

Karl Wright commented on LUCENE-5584:
-------------------------------------

Mike McCandless wonders if, instead of explicit outputs, we create a custom operation class, and override the "add" operator to just modify the value in place.  Apparently Lucene only cares about the add operation for the kinds of things we are doing, and we can likely "fake it out" by this method.  This would have nearly the same effect, he thinks, provided he actually understands what we are trying to do.

I'm hoping Mike will comment also...

> Allow FST read method to also recycle the output value when traversing FST
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5584
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5584
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 4.7.1
>            Reporter: Christian Ziech
>         Attachments: fst-itersect-benchmark.tgz
>
>
> The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output).
> In our use case we intersect a lucene Automaton with a FST<BytesRef> much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created.
> One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility):
> {code}
>   /** Decode an output value previously written with {@link
>    *  #write(Object, DataOutput)} reusing the object passed in if possible */
>   public abstract T read(DataInput in, T reuse) throws IOException;
>   /** Decode an output value previously written with {@link
>    *  #writeFinalOutput(Object, DataOutput)}.  By default this
>    *  just calls {@link #read(DataInput)}. This tries to  reuse the object   
>    *  passed in if possible */
>   public T readFinalOutput(DataInput in, T reuse) throws IOException {
>     return read(in, reuse);
>   }
> {code}
> The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method.
> If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org