You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2010/05/19 19:26:53 UTC

[jira] Created: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline

Add conditional braching/merging to Lucene's analysis pipeline
--------------------------------------------------------------

Key: LUCENE-2470
URL: https://issues.apache.org/jira/browse/LUCENE-2470
Project: Lucene - Java
Issue Type: New Feature
Components: Analysis
Affects Versions: 4.0
Reporter: Steven Rowe
Priority: Minor

Captured from a #lucene brainstorming session with Robert Muir:

Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.

Two use cases:

# StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
# Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold. For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.

One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged. This could be called BranchingFilter.

I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint. A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources. Perhaps a conditional merging facility would be useful as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869317#action_12869317 ] 

Steven Rowe commented on LUCENE-2470:
-------------------------------------

bq. We should also allow for 1 -> many sub-pipelines, eg you conditionally invoke an ngram filter.  Or many -> may, eg you conditionally invoke a shingle filters.

Do you mean that it should be possible to configure multiple filters to process the same input token?  If so, then we could e.g. remove the "outputUnigram" option from ShingleFilter, and configure both a PassThroughFilter and a ShingleFilter to operate simultaneously over the same inputs.



> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
>                 Key: LUCENE-2470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold.  For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.
> One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged.  This could be called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint.  A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources.  Perhaps a conditional merging facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869322#action_12869322 ] 

Michael McCandless commented on LUCENE-2470:
--------------------------------------------

bq. I think one consequence of this design is that the BranchingFilter/Stage would have to do its own merging, so MergingFilter is not necessary, right?

Right.

bq. The other uses for a MergingFilter should be put into another issue, if we go with this design and there is interest, switching this issue to cover only BranchingFilter/Stage.

These are interesting too!

bq. Do you mean that it should be possible to configure multiple filters to process the same input token?

Actually I didn't -- I meant that we should allow a sub-pipeline to process 1 token and produce (say) 3.  But it is a neat idea to allow more than one sub to operate; I like the PassThroughFilter.

bq. Before I forget: It's always bugged me that analysis output can only be to a single field. Could this be the place to fix that?

That's a biggish change :)  I think we should tackle it separately -- we'd have to change indexer for this (right now it visits one field at a time, processing all of its tokens).

But, I do think this write-once attr approach could be used as a document pre-processing pipeline, eg to enhance the doc, pull out additional fields, etc.

> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
>                 Key: LUCENE-2470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold.  For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.
> One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged.  This could be called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint.  A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources.  Perhaps a conditional merging facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869217#action_12869217 ] 

Steven Rowe commented on LUCENE-2470:
-------------------------------------

One more thing from #lucene: if a conditionally-applied filter isn't given one or more input stream tokens, it could either be reset(), or it could detect position increment gaps.  Maybe both behaviors should be selectable via configuration?

> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
>                 Key: LUCENE-2470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold.  For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.
> One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged.  This could be called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint.  A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources.  Perhaps a conditional merging facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869307#action_12869307 ] 

Steven Rowe commented on LUCENE-2470:
-------------------------------------

bq. I think we should allow the conditional to switch between sub-pipelines? EG I could make a stage that detects proper names (say)... and if the token is not a proper name, it'll run through a LowercaseFilter then StopFilter, else it passes through. So the conditional would switch between full sub-pipelines.

I think one consequence of this design is that the BranchingFilter/Stage would have to do its own merging, so MergingFilter is not necessary, right?

The other uses for a MergingFilter should be put into another issue, if we go with this design and there is interest, switching this issue to cover only BranchingFilter/Stage.



> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
>                 Key: LUCENE-2470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold.  For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.
> One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged.  This could be called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint.  A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources.  Perhaps a conditional merging facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869297#action_12869297 ] 

Michael McCandless commented on LUCENE-2470:
--------------------------------------------

This is a great idea!

It'd give us much more composability in the analysis pipeline, since
individual filters (Shingle, Stem) would be fully independent, ie not
aware that they are being invoked from the BranchingFilter.

I think we should allow the conditional to switch between
sub-pipelines?  EG I could make a stage that detects proper names
(say)... and if the token is not a proper name, it'll run through a
LowercaseFilter then StopFilter, else it passes through.  So the
conditional would switch between full sub-pipelines.

We should also allow for 1 -> many sub-pipelines, eg you conditionally
invoke an ngram filter.  Or many -> may, eg you conditionally invoke a
shingle filters.

I think upgrading the analysis pipeline to write-once attr bindings
(LUCENE-2450) would make this BranchingFilter easier to implement.

With write-once bindings, there's full visibility on which attrs a
Stage writes to (changes).  So this BranchingStage could easily
introspect to see which attrs its subs write to, invoke them as the
conditions require, and if none of the conditions apply, copy over the
necessary attrs.


> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
>                 Key: LUCENE-2470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold.  For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.
> One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged.  This could be called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint.  A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources.  Perhaps a conditional merging facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869302#action_12869302 ] 

Robert Muir commented on LUCENE-2470:
-------------------------------------

{quote}
I think we should allow the conditional to switch between
sub-pipelines? EG I could make a stage that detects proper names
(say)... and if the token is not a proper name, it'll run through a
LowercaseFilter then StopFilter, else it passes through. So the
conditional would switch between full sub-pipelines.
{quote}

I really like this aspect of the idea. Besides the language issues that
Steven brought up, we could start to look at the KeywordAttribute/KeywordMarker
as a "hack", and this is a more generalized way to look at it.

I think the real key is, if we can make it nice to do this declaratively,
for example in a Solr schema definition.

This way, someone with a multilanguage document/query could apply
conditional pipelines to different parts, someone could do the 'keyword'
stuff (but this might be based on length, their own custom attribute, POS,
whatever they want).

In truth I think there are a lot of hardcoded 'conditions/parameters' in the 
analysis components right now. Something like this would allow pieces to
 be more general/reusable and flexible.


> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
>                 Key: LUCENE-2470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold.  For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.
> One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged.  This could be called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint.  A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources.  Perhaps a conditional merging facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869319#action_12869319 ] 

Steven Rowe commented on LUCENE-2470:
-------------------------------------

Before I forget: It's always bugged me that analysis output can only be to a single field.  Could this be the place to fix that?

> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
>                 Key: LUCENE-2470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold.  For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.
> One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged.  This could be called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint.  A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources.  Perhaps a conditional merging facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869326#action_12869326 ] 

Steven Rowe commented on LUCENE-2470:
-------------------------------------

bq. I think the real key is, if we can make it nice to do this declaratively, for example in a Solr schema definition.

I agree.

We could start with a BranchingStageFactory that takes in a structured conditional processing specification, but I have the feeling that it will seem like declarative specification of the entire analysis pipeline, ala Solr, is the way to go.



> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
>                 Key: LUCENE-2470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold.  For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.
> One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged.  This could be called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint.  A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources.  Perhaps a conditional merging facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org