You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jason Rutherglen (JIRA)" <ji...@apache.org> on 2009/09/17 23:17:57 UTC

[jira] Created: (LUCENE-1917) ShingleFilter include words

ShingleFilter include words
---------------------------

                 Key: LUCENE-1917
                 URL: https://issues.apache.org/jira/browse/LUCENE-1917
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/analyzers
    Affects Versions: 2.9
            Reporter: Jason Rutherglen
            Priority: Minor
             Fix For: 3.0


By default ShingleFilter creates shingles (i.e. combines tokens
into a single token) from all tokens. For the purposes of for
example, indexing stop words as shingles, however not creating
shingles out of every word, we can supply an include words
CharArraySet to ShingleFilter that determines the tokens to
shingle. 

This is similar to Nutch CommonGrams and SOLR-908. SOLR-908
does not utilize the new token attribute API, and I figured this
functionality is more suitable being a part of Lucene. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1917) ShingleFilter include words

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1917:
-------------------------------------

    Fix Version/s:     (was: 3.0)
                   3.1

Moving out of 3.0

> ShingleFilter include words
> ---------------------------
>
>                 Key: LUCENE-1917
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1917
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 3.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> By default ShingleFilter creates shingles (i.e. combines tokens
> into a single token) from all tokens. For the purposes of for
> example, indexing stop words as shingles, however not creating
> shingles out of every word, we can supply an include words
> CharArraySet to ShingleFilter that determines the tokens to
> shingle. 
> This is similar to Nutch CommonGrams and SOLR-908. SOLR-908
> does not utilize the new token attribute API, and I figured this
> functionality is more suitable being a part of Lucene. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1917) ShingleFilter include words

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758056#action_12758056 ] 

Jason Rutherglen commented on LUCENE-1917:
------------------------------------------

I'm going to port SOLR-908 rather than reuse ShingleFilter as SF seems to be built tightly for it's use case.

> ShingleFilter include words
> ---------------------------
>
>                 Key: LUCENE-1917
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1917
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> By default ShingleFilter creates shingles (i.e. combines tokens
> into a single token) from all tokens. For the purposes of for
> example, indexing stop words as shingles, however not creating
> shingles out of every word, we can supply an include words
> CharArraySet to ShingleFilter that determines the tokens to
> shingle. 
> This is similar to Nutch CommonGrams and SOLR-908. SOLR-908
> does not utilize the new token attribute API, and I figured this
> functionality is more suitable being a part of Lucene. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1917) ShingleFilter include words

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775019#action_12775019 ] 

Robert Muir commented on LUCENE-1917:
-------------------------------------

bq. I'm going to port SOLR-908 rather than reuse ShingleFilter as SF seems to be built tightly for it's use case. 

Jason, is this still your plan? Can we move this out of 3.0 for now?

> ShingleFilter include words
> ---------------------------
>
>                 Key: LUCENE-1917
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1917
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> By default ShingleFilter creates shingles (i.e. combines tokens
> into a single token) from all tokens. For the purposes of for
> example, indexing stop words as shingles, however not creating
> shingles out of every word, we can supply an include words
> CharArraySet to ShingleFilter that determines the tokens to
> shingle. 
> This is similar to Nutch CommonGrams and SOLR-908. SOLR-908
> does not utilize the new token attribute API, and I figured this
> functionality is more suitable being a part of Lucene. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org