You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Semb Wever (JIRA)" <ji...@apache.org> on 2008/09/10 12:58:44 UTC

[jira] Created: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

Patch for ShingleFilter.coterminalPositionIncrement
---------------------------------------------------

                 Key: LUCENE-1380
                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/analyzers
            Reporter: Michael Semb Wever
             Fix For: 2.4


Make it possible for *all* words and shingles to be placed at the same position.

Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 

See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-1380:
---------------------------------------

    Assignee: Grant Ingersoll

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Assignee: Grant Ingersoll
>            Priority: Trivial
>             Fix For: 2.4.1
>
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

Posted by "Michael Semb Wever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629846#action_12629846 ] 

Michael Semb Wever commented on LUCENE-1380:
--------------------------------------------

i suspected such re the option name, but "coterminal" is a word i haven't used since high school.

> I'm -1 on the patch in its current form. If rewritten to modify the position increment only for those shingles that begin at the same word, I'd be +1 (assuming it works and is tested appropriately).

As i said in thread your suggestion does not work.
Setting each shingle to have a positionIncrement=1 so to avoid using the MultiPhraseQuery in favour of the plain PhraseQuery makes sense, but does not work. And not phrasing the query doesn't invoke the ShingleFilter properly.

> The ShingleFilter appears to only work, at least for me, on phrases.
> I would think this correct as each shingle is in fact a sub-phrase to the larger original phrase.

If this is the case, ie ShingleFilter works on phrases as a whole entity, and that shingles from each term in the phrase do have a relationship as they all come from the one phrase, then does it not make sense to have the possibility to position them altogether.

For example in the current implementation, in the phrase "abcd efgh ijkl" it is the first term "abcd" that is responsible for generating the shingles "abcd efgh ijkl" and "abcd efgh". 
What  says that these shingles couldn't be generated from the "efgh" (or "ijkl" for the former shingle) term in an alternative implementation?
Why the presumption that it's in the user's interest to force this separation between where this implementation chooses to put its shingles?

If this isn't lost-in-the-bush-logic, have you a suggestion for a more appropriate option name for the current solution?

> Patch for ShingleFilter.coterminalPositionIncrement
> ---------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Michael Semb Wever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630772#action_12630772 ] 

Michael Semb Wever commented on LUCENE-1380:
--------------------------------------------

Ok. So there's no way to do it through configuration only.
Would a patch with such a TokenFilter be useful for anybody else other than ShingleFilter users? Again i'm a newbie here but i suspect there's no other filter (yet) which works _across_ the tokens (and hence breaks down the importance of positionIncrement) within a query in the way ShingleFilter does. for example from the mailing list from steve:
> On the other hand, I'm not sure how useful position information is for shingles in the general case: they already have relative position info 
> embedded within them.  And how likely is it that one would want to perform a phrase/span query over shingles?  Pretty unlikely, ...

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated LUCENE-1380:
----------------------------------

      Description: 
Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.

Today the shingles generated are synonyms only to the first term in the shingle.
For example the query "abcd efgh ijkl" results in:
   ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")

where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".

There exists no way today to alter which token a particular shingle is a synonym for.
This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.

See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

  was:
Make it possible for *all* words and shingles to be placed at the same position, that is to _all_ be treated as synonyms of each other.

Today the shingles generated are synonyms only to the first term in the shingle.
For example the query "abcd efgh ijkl" results in:
   ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")

where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".

There exists no way today to alter which token a particular shingle is a synonym for.
This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.

See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

typo is editing description.

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Michael Semb Wever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630885#action_12630885 ] 

michaelsembwever edited comment on LUCENE-1380 at 9/14/08 5:58 AM:
---------------------------------------------------------------------

> All this patch does is to set all position increment of the tokens produced by the ShingleFilter to 0, right? 
> I'm going to remove this for 2.4 fix and recommend you to use the filter strategy mentioned. 

The patch to add the new TokenFilter isn't easy-as-abc as lucene needs to have the filter class added to classpath, and Solr needs the TokenFilterFactory added to be able to read it from the configuration files. A lot of work when we're (almost) agreed that removing positional information from all tokens makes sense when using the ShingleFilter.

If it were just the one installation i wouldn't have a problem with adding the custom TokenFilter, but because our use-case is an open sourced and documented system ( read http://sesat.no/howto-solr-query-evaluation.html ) i'd like to make it as easy as possible for third parties.

I would also think that because this is a way to replace commercial and competing technology from FAST that the community would be behind such an enhancement...

      was (Author: michaelsembwever):
    > All this patch does is to set all position increment of the tokens produced by the ShingleFilter to 0, right? 
> I'm going to remove this for 2.4 fix and recommend you to use the filter strategy mentioned. 

The patch to add the new TokenFilter isn't easy-as-abc as lucene needs to have the filter class added to classpath, and Solr needs the TokenFilterFactory added to be able to read it from the configuration files. A lot of work when we're (almost) agreed that removing positional information from all tokens makes sense when using the ShingleFilter.

If it were just the one installation i wouldn't have a problem with adding the custom TokenFilter, but because our use-case is an open sourced and documented system ( read http://sesat.no/howto-solr-query-evaluation.html ) i'd like to make it as easy as possible for third parties.

I would also think that this is a way to replace commercial and competing technology from FAST that the community would be behind such an enhancement...
  
> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634408#action_12634408 ] 

Mck SembWever commented on LUCENE-1380:
---------------------------------------

> Take a look and make sure things are as they should be - the tests pass for me, and I think it's doing what it should do.

Tests run, and code works in my usecase. Thanks Steve.

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-1380:
--------------------------------

         Priority: Trivial  (was: Major)
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
    Fix Version/s:     (was: 2.4)

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated LUCENE-1380:
-------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
    Fix Version/s: 2.4.1

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>             Fix For: 2.4.1
>
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated LUCENE-1380:
----------------------------------

    Attachment:     (was: LUCENE-1380.patch)

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630884#action_12630884 ] 

Karl Wettin commented on LUCENE-1380:
-------------------------------------

bq. Ok. So there's no way to do it through configuration only.

In Solr? Well, I don't really do Solr but I'm pretty sure all you have to do is to create the filter as a new class, add it to the class path and add it as a filter to the query analyzer in your configuration.

bq. Would a patch with such a TokenFilter be useful for anybody else other than ShingleFilter users? 

I'd say no, that it only seems to make sense for shingles at query parsing time.

bq. Again i'm a newbie here but i suspect there's no other filter (yet) which works across the tokens (and hence breaks down the importance of positionIncrement) within a query in the way ShingleFilter does.

I don't understand what you say here. All this patch does is to set all position increment of the tokens produced by the ShingleFilter to 0, right? 

I'm going to remove this for 2.4 fix and recommend you to use the filter strategy mentioned. I'll leave the issue open for discussion though.

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654410#action_12654410 ] 

Mck SembWever commented on LUCENE-1380:
---------------------------------------

ping. are there any committors willing to commit these changes?

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Michael Semb Wever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630885#action_12630885 ] 

Michael Semb Wever commented on LUCENE-1380:
--------------------------------------------

> All this patch does is to set all position increment of the tokens produced by the ShingleFilter to 0, right? 
> I'm going to remove this for 2.4 fix and recommend you to use the filter strategy mentioned. 

The patch to add the new TokenFilter isn't easy-as-abc as lucene needs to have the filter class added to classpath, and Solr needs the TokenFilterFactory added to be able to read it from the configuration files. A lot of work when we're (almost) agreed that removing positional information from all tokens makes sense when using the ShingleFilter.

If it were just the one installation i wouldn't have a problem with adding the custom TokenFilter, but because our use-case is an open sourced and documented system ( read http://sesat.no/howto-solr-query-evaluation.html ) i'd like to make it as easy as possible for third parties.

I would also think that this is a way to replace commercial and competing technology from FAST that the community would be behind such an enhancement...

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633352#action_12633352 ] 

Mck SembWever commented on LUCENE-1380:
---------------------------------------

> separate out this feature to a new filter that modify the position increment. 

As Chris explained in the list this approach would clobber all terms into one big synonym group. There may be other terms in the query outside of the quotes which should not be treated as synonyms to the shingles. And it was also mentioned that there were known bugs when the first token had positionIncrement=0 (or all tokens lay at position zero instead of at position one).
i imagine that this rules out such a position increment TokenFilter.

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633355#action_12633355 ] 

Steven Rowe commented on LUCENE-1380:
-------------------------------------

{qoute}
bq. separate out this feature to a new filter that modify the position increment. 

There may be other terms in the query outside of the quotes which should not be treated as synonyms to the shingles.
{quote}

but they won't be in the same field, right?  Solr has per-field analysis facilities.

bq. And it was also mentioned that there were known bugs when the first token had positionIncrement=0 (or all tokens lay at position zero instead of at position one).

You can tell the filter to set posincr=1 for the first token.

When it receives null from its predecessor in the filter chain, it can reset its "at the beginning" flag, and the next time it's used, it'll give posincr=1 for the first token again.

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630700#action_12630700 ] 

Karl Wettin commented on LUCENE-1380:
-------------------------------------

One could argue that what you should do rather than using this patch is to add a TokenFilter that sets all positionIncrement to 0.

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated LUCENE-1380:
----------------------------------

    Attachment: LUCENE-1380-PositionFilter.patch

> If you really want to do this change is this layer I suggest that you seperate out this feature to a new filter that modify 
> the position increment.

Attaching alternative patch as suggested for PositionFilter and its test.
The first token always maintains its original positionIncrement, but subsequent tokens in the TokenStream has their positionIncrement set to match the value of PositionFilter.positionIncrement

I still fail to understand why Karl and Steve would rather see this logic in the QueryParser. The best explanation so far was from Steve:
> IMO, the correct layer to solve this is in Solr's QParser - 
> I think there should be a way to tell the parser not to parse, but rather to send the whole query to be analyzed.

but i wouldn't be surprised if this goes against the grain of how Solr works.


> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-1380:
--------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
         Assignee:     (was: Karl Wettin)

I'm unassigning myself from this issue as there are so many votes and I consider it a hack to add a change whos soul purpose is to change the behavior of a query parser and I don't think such a thing should be committed. I think the focus should be on the query parser and I understand that is a lot more work than modifying the shingle filter. If you really want to do this change is this layer I suggest that you seperate out this feature to a new filter that modify the position increment.

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved LUCENE-1380.
-------------------------------------

       Resolution: Fixed
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed revision 725691.

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Assignee: Grant Ingersoll
>            Priority: Trivial
>             Fix For: 2.4.1
>
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655633#action_12655633 ] 

Grant Ingersoll commented on LUCENE-1380:
-----------------------------------------

Just to be clear, Mck, what changes are you asking about?  The position filter one or the broader Shingle one?

If I'm reading the thread correctly, I think everyone settled on just going w/ the position filter changes, right?

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Assignee: Grant Ingersoll
>            Priority: Trivial
>             Fix For: 2.4.1
>
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated LUCENE-1380:
----------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
          Summary: Patch for ShingleFilter.enablePositions (or PositionFilter)  (was: Patch for ShingleFilter.enablePositions)

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633355#action_12633355 ] 

steve_rowe edited comment on LUCENE-1380 at 9/22/08 8:56 AM:
--------------------------------------------------------------

{quote}
bq. separate out this feature to a new filter that modify the position increment. 

There may be other terms in the query outside of the quotes which should not be treated as synonyms to the shingles.
{quote}

but they won't be in the same field, right?  Solr has per-field analysis facilities.

bq. And it was also mentioned that there were known bugs when the first token had positionIncrement=0 (or all tokens lay at position zero instead of at position one).

You can tell the filter to set posincr=1 for the first token.

When it receives null from its predecessor in the filter chain, it can reset its "at the beginning" flag, and the next time it's used, it'll give posincr=1 for the first token again.

      was (Author: steve_rowe):
    {qoute}
bq. separate out this feature to a new filter that modify the position increment. 

There may be other terms in the query outside of the quotes which should not be treated as synonyms to the shingles.
{quote}

but they won't be in the same field, right?  Solr has per-field analysis facilities.

bq. And it was also mentioned that there were known bugs when the first token had positionIncrement=0 (or all tokens lay at position zero instead of at position one).

You can tell the filter to set posincr=1 for the first token.

When it receives null from its predecessor in the filter chain, it can reset its "at the beginning" flag, and the next time it's used, it'll give posincr=1 for the first token again.
  
> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1380:
---------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
    Fix Version/s:     (was: 2.4.1)
                   2.9

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Assignee: Grant Ingersoll
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655637#action_12655637 ] 

Mck SembWever commented on LUCENE-1380:
---------------------------------------

Yes we agreed with the PositionFilter approach. 
It works well (and is in production at http://sesam.no) 
and steers clear of having to decide whether ShingleFilter, solely by itself, was intended to be used in such a manner and hence if such positioning functionality should be encapsulated there.


> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Assignee: Grant Ingersoll
>            Priority: Trivial
>             Fix For: 2.4.1
>
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

Posted by "Michael Semb Wever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Semb Wever updated LUCENE-1380:
---------------------------------------

    Attachment: LUCENE-1380.patch

Addition to ShingleFilter for property coterminalPositionIncrement.
New corresponding test in ShingleFilterTest.

> Patch for ShingleFilter.coterminalPositionIncrement
> ---------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated LUCENE-1380:
----------------------------------

    Attachment: LUCENE-1380-PositionFilter.patch

Re-attached the PositionFilter patch addressing Steve's moderation comments. (2)
Steve,  can you look at the reset versus null token in stream difference. Are both approaches valid to test? (I'd not overridden TokenStream.reset() in the previous patch).

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630768#action_12630768 ] 

Karl Wettin commented on LUCENE-1380:
-------------------------------------

>> One could argue that what you should do rather than using this patch is to add a TokenFilter that sets all positionIncrement to 0.
>Really? You'll have to excuse me - i am very new to Lucene.
>How would i go about that? Such a TokenFilter exists already?

{code:java}
new TokenFilter(input) {
  public Token next(Token reusableToken) throws IOException {
    reusableToken = input.next(reusableToken);
    reusableToken.setPositionIncrement(0);
    return reusableToken;
  }
};
{code}

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated LUCENE-1380:
----------------------------------

      Description: 
Make it possible for *all* words and shingles to be placed at the same position, that is to _all_ be treated as synonyms of each other.

Today the shingles generated are synonyms only to the first term in the shingle.
For example the query "abcd efgh ijkl" results in:
   ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")

where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".

There exists no way today to alter which token a particular shingle is a synonym for.
This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.

See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

  was:
Make it possible for *all* words and shingles to be placed at the same position.

Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 

See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Updated description to include a more layman's explanation.
Maybe the option should be called "commonSynonyms" or the like...

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is to _all_ be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634191#action_12634191 ] 

Steven Rowe commented on LUCENE-1380:
-------------------------------------

When I wrote:
bq. 4.  You should provide a standalone test for the PositionFilter, in addition to the ShingleFilterTest tests.

I meant that testing of PositionFilter should be separate from testing its functionality with ShingleFilter.  Your PositionFilter tests looks at offsets, which PositionFilter doesn't affect at all.  It is possible that PositionFilter will be used for other things than ShingleFilter.  Hence, there should be basic test(s) that evaluate PositionFilter without ShingleFilter.

I also think a test to make sure a single instance of PositionFilter will work with multiple documents should be added.

BTW, you don't need to delete JIRA attachments if you want to upload a new version - when you upload a same-named file, the most recent version of the file will be colored black, and older versions will be colored gray.  This is the conventional way Lucene uses JIRA.  It allows people to follow the JIRA comments in the progressive versions of the patch(es).

A typo on line 66 of PositionFilterTest: 
{code:java}
            // end of stream so reset firstTokePositioned
{code}


> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin reassigned LUCENE-1380:
-----------------------------------

    Assignee: Karl Wettin

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

Posted by "Michael Semb Wever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Semb Wever updated LUCENE-1380:
---------------------------------------

    Attachment:     (was: LUCENE-1380.patch)

> Patch for ShingleFilter.coterminalPositionIncrement
> ---------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-1380:
--------------------------------

    Attachment: LUCENE-1380.patch

Renamed field to usingPositionIncrement to avoid confusion, and added a bunch of javadocs compiled from the issue comments:

{code:java}
/**
   * If true each original token (unigram) or the first related shingle from it
   * will get a {@link org.apache.lucene.analysis.Token#getPositionIncrement() positionIncrement} of 1,
   * if false all shingle tokens will get a {@link org.apache.lucene.analysis.Token#getPositionIncrement() positionIncrement} of 0.
   * <p>
   * Default value is true.
   * <p>
   * This attribute is typically set false in conjunction with use of the QueryParser that
   * when set true will create a MultiPhraseQuery where at least one word/shingle must be
   * matched from each word/token, not desired in all situations. Setting this to false
   * will instead create a PhraseQuery.
   *
   * @param usingPositionIncrement the coterminal token positionIncrement setting.
   */
  public void setUsingPositionIncrement(boolean usingPositionIncrement){
      this.usingPositionIncrement = usingPositionIncrement;
  }
{code}

Did I get that right?

Steve, are you still -1? I don't see any harm in this patch.

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Michael Semb Wever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Semb Wever updated LUCENE-1380:
---------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
          Summary: Patch for ShingleFilter.enablePositions  (was: Patch for ShingleFilter.coterminalPositionIncrement)

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629827#action_12629827 ] 

Steven Rowe commented on LUCENE-1380:
-------------------------------------

As I said in the thread on java-user that spawned this issue: <http://www.nabble.com/Replacing-FAST-functionality-at-sesam.no---ShingleFilter%2B-exact-matching-td19396291.html> (emphasis added):

{quote}
It works because you've set all of the shingles to be at the same position - probably better to change the one instance of .setPositionIncrement(0) to .setPositionIncrement(1) - that way, MultiPhraseQuery will not be invoked, and the standard disjunction thing should happen.

> [W]ould a patch to ShingleFilter that offers an option
> "unigramPositionIncrement" (that defaults to 1) likely be
> accepted into trunk?

The issue is not directly related to whether a unigram is involved, but rather whether or not _*tokens that begin at the same word*_ are given the same position.  The option thus should be named something like "coterminalPositionIncrement".  This seems like a reasonable addition, and a patch likely would be accepted, if it included unit tests.
{quote}

You have used the option name I suggested, but have implemented it in a form that doesn't follow the name -- in your implementation, *all* tokens are placed at the same position, not just those that start at the same word -- and I think this form is inappropriate for the general user.

I'm -1 on the patch in its current form.  If rewritten to modify the position increment only for those shingles that begin at the same word, I'd be +1 (assuming it works and is tested appropriately).

> Patch for ShingleFilter.coterminalPositionIncrement
> ---------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1380:
--------------------------------

    Attachment: LUCENE-1380-PositionFilter.patch

Mck, I was wrong about Filter testing over multiple docs - each instance of a Filter is defined only over a single doc, so this doesn't make sense.

However, you are completely on the right track with the reset() operation, since PositionFilter is sensitive to whether it's at the beginning of a stream, and it should respond as you have written it.

So, since I was wrong about PositionFilter needing to handle usage with multiple documents, the else clause that I said should go in (upon receiving null from the input stream) should come back out.  In fact, the proper response from a filter in the analysis chain upon encountering null is to stop processing, since it means end-of-stream, so I've removed your tests with null embedded in this revised patch.

bq. Steve, can you look at the reset versus null token in stream difference. Are both approaches valid to test? (I'd not overridden TokenStream.reset() in the previous patch).

I removed the void-return filterTest(), since it wasn't called from anywhere, and it only used ShingleFilter, and no PositionFilter.  In its place I've added another test named testReset().

I added a test that checks for non-default positionIncrement: testNonZeroPositionIncrement().

I removed PositionFilter.setPositionIncrement(), because using it one could potentially change the position increment in mid-stream, which makes little sense.  The alternate constructor provides a way to set it.

In the patch, I have modified the formatting a little to conform to Lucene convention, which is outlined on the [HowToContribute wiki page|http://wiki.apache.org/lucene-java/HowToContribute#head-59ae13df098fbdcc46abdf980aa8ee76d3ee2e3b]:

{quote}
* Code should be formatted according to [Sun's conventions|http://java.sun.com/docs/codeconv/] with one exception:
** indent two spaces per level, not four.
{quote}

I ran "svn diff" under the trunk/ directory, instead of in trunk/contrib/analyzers/ (where you based your patches) - it's simpler for people who look at a lot of these things to have them always be based from trunk/.

Take a look and make sure things are as they should be - the tests pass for me, and I think it's doing what it should do.

If you agree, then hopefully we can get Karl (or another committer, which I'm not) to take a look and see if they think it can be committed.


> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated LUCENE-1380:
----------------------------------

    Attachment: LUCENE-1380-PositionFilter.patch

Re-attached the PositionFilter patch addressing Steve's moderation comments. 

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

Posted by "Michael Semb Wever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Semb Wever updated LUCENE-1380:
---------------------------------------

    Attachment: LUCENE-1380.patch

New version with option named enablePositions

> Patch for ShingleFilter.coterminalPositionIncrement
> ---------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated LUCENE-1380:
----------------------------------

    Attachment: LUCENE-1380.patch

Updated version that ensures first token always has positionIncrement=1

(Karl's changes from his patch are in this patch).

> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions

Posted by "Michael Semb Wever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630762#action_12630762 ] 

Michael Semb Wever commented on LUCENE-1380:
--------------------------------------------

> One could argue that what you should do rather than using this patch is to add a TokenFilter that sets all positionIncrement to 0. 

Really? You'll have to excuse me - i am very new to Lucene.
How would i go about that? Such a TokenFilter exists already?

> Setting this to false will instead create a PhraseQuery.

This isn't correct. PhraseQuery is used when every token has a non-zero positionIncrement, ie when severalTokensAtSamePosition == false.
What does happen is that the MultiPhraseQuery that is constructed is limited to one-dimension.


> Patch for ShingleFilter.enablePositions
> ---------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>            Assignee: Karl Wettin
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated LUCENE-1380:
----------------------------------

    Attachment:     (was: LUCENE-1380-PositionFilter.patch)

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633756#action_12633756 ] 

Steven Rowe commented on LUCENE-1380:
-------------------------------------

A couple of comments on the PositionFilter patch:

# The javadocs should be more explicit, e.g. about the fact that positionIncrement defaults to zero
# I think there ought to be a constructor that takes in a positionIncrement, perhaps instead of the setter.
# You don't handle the case where the filter is used for more than one document; there should be an else clause that resets firstTokenPositioned to false after this block:
{code:java}
if(null != reusableToken){
  if(firstTokenPositioned){
    reusableToken.setPositionIncrement(positionIncrement);
  }else{
    firstTokenPositioned = true;
  }
}
{code}
# You should provide a standalone test for the PositionFilter, in addition to the ShingleFilterTest tests.

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Mck SembWever
>            Priority: Trivial
>         Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position, that is for _all_ shingles (and unigrams if included) to be treated as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the shingle.
> For example the query "abcd efgh ijkl" results in:
>    ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a synonym for.
> This patch takes the first step in making it possible to make all shingles (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org