You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2010/04/18 17:30:25 UTC

[jira] Created: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
---------------------------------------------------------------------------------------------------------------

                 Key: LUCENE-2400
                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/analyzers
    Affects Versions: 3.0.1
            Reporter: Steven Rowe
            Priority: Minor


When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.

Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858326#action_12858326 ] 

Steven Rowe commented on LUCENE-2400:
-------------------------------------

I tried adding specialized versions of CharTermAttribute.append(StringBuilder,...):


{code:java}public CharTermAttribute append(StringBuilder builder) {
  return append(builder, 0, builder.length());
}
public CharTermAttribute append(StringBuilder builder, int start, int end) {
  int newTermLength = termLength + end - start;
  resizeBuffer(newTermLength);
  builder.getChars(start, end, termBuffer, termLength);
  termLength = newTermLength;
  return this;
}
{code}

This helped a little bit, but it's still slower than the fully-spelled-out CharTermAttribute setting code that was previously in place:

JAVA:
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|3.08s|3.26s|2.11s|-15.6%|
|2|yes|3.26s|3.41s|2.11s|-11.4%|
|4|no|4.05s|4.49s|2.11s|-18.4%|
|4|yes|4.17s|4.64s|2.11s|-18.5%|


> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch, LUCENE-2400.patch, LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858344#action_12858344 ] 

Uwe Schindler edited comment on LUCENE-2400 at 4/18/10 5:28 PM:
----------------------------------------------------------------

bq. I tried adding specialized versions of CharTermAttribute.append(StringBuilder,...): 

Did you also add this to the interface, else your code would not use this method. LUCENE-2401 does not have the start,end methods, as this is not even in StringBuilder.

      was (Author: thetaphi):
    bq. I tried adding specialized versions of CharTermAttribute.append(StringBuilder,...): 

Did you also add this to the interface, else your code would not use this method. LUCENE-1401 does not have the start,end methods, as this is not even in StringBuilder.
  
> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch, LUCENE-2400.patch, LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2400:
--------------------------------

    Attachment: LUCENE-2400.patch

This patch implements Uwe's suggestion (on #lucene-dev) of switching term attribute setting to use the simpler termAtt.append(gramBuilder).  However, this seems to slow things down:

JAVA:
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|3.12s|3.36s|2.09s|-18.8%|
|2|yes|3.28s|3.54s|2.09s|-17.8%|
|4|no|4.00s|4.61s|2.09s|-24.1%|
|4|yes|4.14s|4.72s|2.09s|-22.0%|


> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch, LUCENE-2400.patch, LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858344#action_12858344 ] 

Uwe Schindler commented on LUCENE-2400:
---------------------------------------

bq. I tried adding specialized versions of CharTermAttribute.append(StringBuilder,...): 

Did you also add this to the interface, else your code would not use this method. LUCENE-1401 does not have the start,end methods, as this is not even in StringBuilder.

> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch, LUCENE-2400.patch, LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858307#action_12858307 ] 

Uwe Schindler commented on LUCENE-2400:
---------------------------------------

Alternatively you can directly pass it to StringBuilder, as CharTermAttribute implements CharSequence: [http://java.sun.com/j2se/1.5.0/docs/api/java/lang/StringBuilder.html#append(java.lang.CharSequence)]. But copying the buffer as Robert suggested should be faster.

Both variant are faster than creating a new String.

> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2400:
--------------------------------

    Attachment: LUCENE-2400.patch

Patch implementing the above-described changes, along with tests confirming that all-filler shingles/unigrams are no longer output.  A new term attribute called FillerAttribute is defined to mark whether enqueued terms are filler terms.

Unfortunately, these changes cause a roughly 22% slowdown - contrib/benchmark numbers for the shingle alg (I got similar numbers for Java 1.5):

JAVA:
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|3.04s|3.33s|2.05s|-22.6%|
|2|yes|3.23s|3.49s|2.05s|-18.0%|
|4|no|4.00s|4.56s|2.05s|-22.2%|
|4|yes|4.13s|4.72s|2.05s|-22.0%|


> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858312#action_12858312 ] 

Steven Rowe commented on LUCENE-2400:
-------------------------------------

bq. i would recommend gramBuilder.append(termAtt.buffer(), 0, termAtt.length()) like before, maybe its just the extra gc cost of creating useless strings?

I'll give this a try.



> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858306#action_12858306 ] 

Robert Muir commented on LUCENE-2400:
-------------------------------------

bq. Unfortunately, these changes cause a roughly 22% slowdown - contrib/benchmark numbers for the shingle alg (I got similar numbers for Java 1.5):

Steven, i wonder if this is because of a stupid thing, I noticed this in your patch:
{noformat}
-      shingleBuilder.append(termAtt.termBuffer(), 0, termAtt.termLength());
+      gramBuilder.append(charTermAtt.toString());
{noformat}

i would recommend gramBuilder.append(termAtt.buffer(), 0, termAtt.length()) like before, maybe its just the extra gc cost of creating useless strings?

> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858343#action_12858343 ] 

Uwe Schindler commented on LUCENE-2400:
---------------------------------------

The improvements are in LUCENE-2401.

> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch, LUCENE-2400.patch, LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2400:
--------------------------------

    Attachment: LUCENE-2400.patch

Robert's change cut the performance penalty in half:

JAVA:
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|3.15s|3.25s|2.08s|-8.4%|
|2|yes|3.29s|3.42s|2.08s|-9.6%|
|4|no|4.07s|4.39s|2.08s|-13.8%|
|4|yes|4.12s|4.54s|2.08s|-17.0%|


> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch, LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858353#action_12858353 ] 

Steven Rowe commented on LUCENE-2400:
-------------------------------------

Uwe told me on #lucene-dev that without adding the specialized CharTermAttribute methods to the interface, they wouldn't get invoked, and so since I didn't, the numbers in the previous post are meaningless.

So, I applied LUCENE-2401 to add the correct form of the specializations, then re-ran the shingle alg, and it looks like there is no longer a penalty for using the shorthand form Uwe suggested.  Here are the numbers:

JAVA:
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|3.21s|3.31s|2.12s|-8.3%|
|2|yes|3.40s|3.54s|2.12s|-9.8%|
|4|no|4.17s|4.57s|2.12s|-16.2%|
|4|yes|4.33s|4.75s|2.12s|-15.9%|


> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch, LUCENE-2400.patch, LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org