You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2016/11/16 14:04:58 UTC

[jira] [Commented] (LUCENE-2450) Explore write-once attr bindings in the analysis chain

    [ https://issues.apache.org/jira/browse/LUCENE-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670481#comment-15670481 ] 

David Smiley commented on LUCENE-2450:
--------------------------------------

I really like the ideas here!  It would make capture/restore cheaper.  Some filters like WordDelimiterFilter don't use capture/restore, I think, in the name of efficiency but then it only knows about some built-in attributes, not custom ones people add.  The heavy-weight aspect of capture/restore is my main beef with the current design.

> Explore write-once attr bindings in the analysis chain
> ------------------------------------------------------
>
>                 Key: LUCENE-2450
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2450
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Michael McCandless
>              Labels: gsoc2014
>         Attachments: LUCENE-2450.patch, LUCENE-2450.patch, pipeline.py
>
>
> I'd like to propose a new means of tracking attrs through the analysis
> chain, whereby a given stage in the pipeline cannot overwrite attrs
> from stages before it (write once).  It can only write to new attrs
> (possibly w/ the same name) that future stages can see; it can never
> alter the attrs or bindings from the prior stages.
> I coded up a prototype chain in python (I'll attach), showing the
> equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter ->
> Indexer.
> Each stage "sees" a frozen namespace of attr bindings as its input;
> these attrs are all read-only from its standpoint.  Then, it writes to
> an "output namespace", which is read/write, eg it can add new attrs,
> remove attrs from its input, change the values of attrs.  If that
> stage doesn't alter a given attr it "passes through", unchanged.
> This would be an enormous change to how attrs are managed... so this
> is very very exploratory at this point.  Once we decouple indexer from
> analysis, creating such an alternate chain should be possible -- it'd
> at least be a good test that we've decoupled enough :)
> I think the idea offers some compelling improvements over the "global
> read/write namespace" (AttrFactory) approach we have today:
>   * Injection filters can be more efficient -- they need not
>     capture/restoreState at all
>   * No more need for the initial tokenizer to "clear all attrs" --
>     each stage becomes responsible for clearing the attrs it "owns"
>   * You can truly stack stages (vs having to make a custom
>     AttrFactory) -- eg you could make a Bocu1 stage which can stack
>     onto any other stage.  It'd look up the CharTermAttr, remove it
>     from its output namespace, and add a BytesRefTermAttr.
>   * Indexer should be more efficient, in that it doesn't need to
>     re-get the attrs on each next() -- it gets them up front, and
>     re-uses them.
> Note that in this model, the indexer itself is just another stage in
> the pipeline, so you could do some wild things like use 2 indexer
> stages (writing to different indexes, or maybe the same index but
> somehow with further processing or something).
> Also, in this approach, the analysis chain is more informed about the
> what each stage is allowed to change, up front after the chain is
> created.  EG (say) we will know that only 2 stages write to the term
> attr, and that only 1 writes posIncr/offset attrs, etc.  Not sure
> if/how this helps us... but it's more strongly typed than what we have
> today.
> I think we could use a similar chain for processing a document at the
> field level, ie, different stages could add/remove/change different
> fields in the doc....



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org