You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Olivier Favre (JIRA)" <ji...@apache.org> on 2011/08/22 16:12:29 UTC
[jira] [Created] (LUCENE-3392) Combining analyzers output
Combining analyzers output
--------------------------
Key: LUCENE-3392
URL: https://issues.apache.org/jira/browse/LUCENE-3392
Project: Lucene - Java
Issue Type: New Feature
Reporter: Olivier Favre
Priority: Minor
Fix For: 3.4
It should be easy to combine the output of multiple Analyzers, or TokenStreams.
A ComboAnalyzer and a ComboTokenStream class would take multiple instances, and multiplex their output, keeping a rough order of tokens like increasing position then increasing start offset then increasing end offset.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3392) Combining analyzers output
Posted by "Olivier Favre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089415#comment-13089415 ]
Olivier Favre commented on LUCENE-3392:
---------------------------------------
The proposed implementation may a have tight bond with the JVM implementation of some classes (StringReader, BufferedReader and FilterReader), as they rely on a named private field (respectively "str", "in" and "in").
This can be avoided, but any Reader should then be fully read and stored as a String or a char[], which can have a huge overhead.
Considering each clone would get read relatively at the same speed (well, only for word delimiting analysis, not for a KeywordAnalyzer) an implementation could only retain in memory the portion read by at least one cloned reader but not all clones, in order to implement a "multi read head" reader.
Another implementation would be to change the API to give a CloneableReader interface with a "giveAClone()" function instead of a Reader for tokenStream and reusableTokenStream functions.
But this involves massive refactoring (>13,000 lines) and introduces an important API break.
The proposed implementation is the best solution I found.
Any suggestions are welcome!
> Combining analyzers output
> --------------------------
>
> Key: LUCENE-3392
> URL: https://issues.apache.org/jira/browse/LUCENE-3392
> Project: Lucene - Java
> Issue Type: New Feature
> Reporter: Olivier Favre
> Priority: Minor
> Labels: analysis
> Fix For: 3.4
>
> Attachments: ComboAnalyzer-lucene3x.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> It should be easy to combine the output of multiple Analyzers, or TokenStreams.
> A ComboAnalyzer and a ComboTokenStream class would take multiple instances, and multiplex their output, keeping a rough order of tokens like increasing position then increasing start offset then increasing end offset.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3392) Combining analyzers output
Posted by "Olivier Favre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olivier Favre updated LUCENE-3392:
----------------------------------
Component/s: modules/analysis
> Combining analyzers output
> --------------------------
>
> Key: LUCENE-3392
> URL: https://issues.apache.org/jira/browse/LUCENE-3392
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Olivier Favre
> Priority: Minor
> Labels: analysis
> Fix For: 3.4, 4.0
>
> Attachments: ComboAnalyzer-lucene-trunk.patch, ComboAnalyzer-lucene3x.patch, ComboAnalyzer-lucene3x.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> It should be easy to combine the output of multiple Analyzers, or TokenStreams.
> A ComboAnalyzer and a ComboTokenStream class would take multiple instances, and multiplex their output, keeping a rough order of tokens like increasing position then increasing start offset then increasing end offset.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3392) Combining analyzers output
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-3392:
---------------------------------------
Fix Version/s: (was: 3.4)
3.5
> Combining analyzers output
> --------------------------
>
> Key: LUCENE-3392
> URL: https://issues.apache.org/jira/browse/LUCENE-3392
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Olivier Favre
> Priority: Minor
> Labels: analysis
> Fix For: 3.5, 4.0
>
> Attachments: ComboAnalyzer-lucene-trunk.patch, ComboAnalyzer-lucene3x.patch, ComboAnalyzer-lucene3x.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> It should be easy to combine the output of multiple Analyzers, or TokenStreams.
> A ComboAnalyzer and a ComboTokenStream class would take multiple instances, and multiplex their output, keeping a rough order of tokens like increasing position then increasing start offset then increasing end offset.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3392) Combining analyzers output
Posted by "Olivier Favre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olivier Favre updated LUCENE-3392:
----------------------------------
Attachment: ComboAnalyzer-lucene3x.patch
Patch for lucene-3x.
Tested with Sun's Java 1.6.0_26-b03.
Uses a special factory for cloning Readers, some implementation use reflection to gain access to private fields in order to reduce the need to read and copy a Readers' content.
> Combining analyzers output
> --------------------------
>
> Key: LUCENE-3392
> URL: https://issues.apache.org/jira/browse/LUCENE-3392
> Project: Lucene - Java
> Issue Type: New Feature
> Reporter: Olivier Favre
> Priority: Minor
> Labels: analysis
> Fix For: 3.4
>
> Attachments: ComboAnalyzer-lucene3x.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> It should be easy to combine the output of multiple Analyzers, or TokenStreams.
> A ComboAnalyzer and a ComboTokenStream class would take multiple instances, and multiplex their output, keeping a rough order of tokens like increasing position then increasing start offset then increasing end offset.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3392) Combining analyzers output
Posted by "Olivier Favre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olivier Favre updated LUCENE-3392:
----------------------------------
Attachment: ComboAnalyzer-lucene3x.patch
Moved analysis related changes into contrib/analysers/common, like the patch for the trunk.
Small changes:
- 2 space indentation (was 4 before, my personal default value)
- removed a few useless imports
- simplified ComboTokenStream, and fixes, as I saw functions have become final in the trunk.
> Combining analyzers output
> --------------------------
>
> Key: LUCENE-3392
> URL: https://issues.apache.org/jira/browse/LUCENE-3392
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Olivier Favre
> Priority: Minor
> Labels: analysis
> Fix For: 3.4, 4.0
>
> Attachments: ComboAnalyzer-lucene-trunk.patch, ComboAnalyzer-lucene3x.patch, ComboAnalyzer-lucene3x.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> It should be easy to combine the output of multiple Analyzers, or TokenStreams.
> A ComboAnalyzer and a ComboTokenStream class would take multiple instances, and multiplex their output, keeping a rough order of tokens like increasing position then increasing start offset then increasing end offset.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3392) Combining analyzers output
Posted by "Olivier Favre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olivier Favre updated LUCENE-3392:
----------------------------------
Fix Version/s: 4.0
> Combining analyzers output
> --------------------------
>
> Key: LUCENE-3392
> URL: https://issues.apache.org/jira/browse/LUCENE-3392
> Project: Lucene - Java
> Issue Type: New Feature
> Reporter: Olivier Favre
> Priority: Minor
> Labels: analysis
> Fix For: 3.4, 4.0
>
> Attachments: ComboAnalyzer-lucene-trunk.patch, ComboAnalyzer-lucene3x.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> It should be easy to combine the output of multiple Analyzers, or TokenStreams.
> A ComboAnalyzer and a ComboTokenStream class would take multiple instances, and multiplex their output, keeping a rough order of tokens like increasing position then increasing start offset then increasing end offset.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3392) Combining analyzers output
Posted by "Olivier Favre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olivier Favre updated LUCENE-3392:
----------------------------------
Attachment: ComboAnalyzer-lucene-trunk.patch
Patch for lucene-trunk.
Tested with sun's Java 1.6.0_26-b03.
Adds support for Reader cloning in lucene's core, and the analysis stuff in modules/analysis/common
> Combining analyzers output
> --------------------------
>
> Key: LUCENE-3392
> URL: https://issues.apache.org/jira/browse/LUCENE-3392
> Project: Lucene - Java
> Issue Type: New Feature
> Reporter: Olivier Favre
> Priority: Minor
> Labels: analysis
> Fix For: 3.4
>
> Attachments: ComboAnalyzer-lucene-trunk.patch, ComboAnalyzer-lucene3x.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> It should be easy to combine the output of multiple Analyzers, or TokenStreams.
> A ComboAnalyzer and a ComboTokenStream class would take multiple instances, and multiplex their output, keeping a rough order of tokens like increasing position then increasing start offset then increasing end offset.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org