You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2010/04/18 17:39:26 UTC

[jira] Updated: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

     [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2400:
--------------------------------

    Attachment: LUCENE-2400.patch

Patch implementing the above-described changes, along with tests confirming that all-filler shingles/unigrams are no longer output.  A new term attribute called FillerAttribute is defined to mark whether enqueued terms are filler terms.

Unfortunately, these changes cause a roughly 22% slowdown - contrib/benchmark numbers for the shingle alg (I got similar numbers for Java 1.5):

JAVA:
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|3.04s|3.33s|2.05s|-22.6%|
|2|yes|3.23s|3.49s|2.05s|-18.0%|
|4|no|4.00s|4.56s|2.05s|-22.2%|
|4|yes|4.13s|4.72s|2.05s|-22.0%|


> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org