You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2010/11/08 22:28:07 UTC

[jira] Created: (LUCENE-2749) Lexically sorted shingle filter

Lexically sorted shingle filter
-------------------------------

                 Key: LUCENE-2749
                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Analysis
    Affects Versions: 3.1, 4.0
            Reporter: Steven Rowe
            Priority: Minor
             Fix For: 3.1, 4.0


Sometimes people want to know if words have co-occurred within a specific window onto the token stream, but don't care what the order is.  A Lucene token filter (LexicallySortedWindowFilter?), perhaps implemented as a ShingleFilter sub-class, could provide this functionality.

This feature would allow for exact term set equality queries (in the case of a full-field-width window).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2749) Co-occurrence filter

Posted by "Elmar Pitschke (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007837#comment-13007837 ] 

Elmar Pitschke commented on LUCENE-2749:
----------------------------------------

The first use case that comes into my mind is the filtering of possible names. One of the request i always get is the automatic generation of tag-clouds with a consideration in the search results. I think this would be one possibility to get names without the need to maintain a word list.
Another thing of course would be to get some kind of semantic combination of words. So you could get to more "natural" search experience. I think if a user search for two words and these are quite near in a text it may be more useful than a lot of occurances of the two words but with no combination.
Which use cases do you have in mind?

> Co-occurrence filter
> --------------------
>
>                 Key: LUCENE-2749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>
> The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.  
> These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). 
> The parameters to this filter will be: 
> * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
> * minimum number of co-occurring terms: >= 2
> * maximum number of co-occurring terms: <= window size
> * token set ordering (lexical or positional)
> One use case for co-occurring token sets is as candidates for collocations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2749) Co-occurrence filter

Posted by "Elmar Pitschke (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006222#comment-13006222 ] 

Elmar Pitschke commented on LUCENE-2749:
----------------------------------------

Hi,
i am fairly new to Lucene development, but i have plenty experience using it :). I would like to make some contribution and think this would be a good task for me to start, as i am fairly interested in the analysis part. Can i work on this task or has there been any work done on this yet?
Regards
   Elmar

> Co-occurrence filter
> --------------------
>
>                 Key: LUCENE-2749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>
> The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.  
> These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). 
> The parameters to this filter will be: 
> * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
> * minimum number of co-occurring terms: >= 2
> * maximum number of co-occurring terms: <= window size
> * token set ordering (lexical or positional)
> One use case for co-occurring token sets is as candidates for collocations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2749) Co-occurrence filter

Posted by "Elmar Pitschke (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006812#comment-13006812 ] 

Elmar Pitschke commented on LUCENE-2749:
----------------------------------------

Hi Steven,
thanks for the info, i will work through it and get back here with some questions.
As i have a lot to do with Lucene at my work, this filter would definitely something that i could use. So the work would not be lost ;)
Regards
   Elmar

> Co-occurrence filter
> --------------------
>
>                 Key: LUCENE-2749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>
> The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.  
> These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). 
> The parameters to this filter will be: 
> * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
> * minimum number of co-occurring terms: >= 2
> * maximum number of co-occurring terms: <= window size
> * token set ordering (lexical or positional)
> One use case for co-occurring token sets is as candidates for collocations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2749) Co-occurrence filter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007229#comment-13007229 ] 

Steven Rowe commented on LUCENE-2749:
-------------------------------------

bq. this filter would definitely something that i could use

What use case(s) are you thinking of?

> Co-occurrence filter
> --------------------
>
>                 Key: LUCENE-2749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>
> The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.  
> These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). 
> The parameters to this filter will be: 
> * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
> * minimum number of co-occurring terms: >= 2
> * maximum number of co-occurring terms: <= window size
> * token set ordering (lexical or positional)
> One use case for co-occurring token sets is as candidates for collocations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2749) Co-occurrence filter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2749:
--------------------------------

    Description: 
The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.  

These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). 

The parameters to this filter will be: 

* window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
* minimum number of co-occurring terms: >= 2
* maximum number of co-occurring terms: <= window size
* token set ordering (lexical or positional)

One use case for co-occurring token sets is as candidates for collocations.

  was:
Sometimes people want to know if words have co-occurred within a specific window onto the token stream, but don't care what the order is.  A Lucene token filter (LexicallySortedWindowFilter?), perhaps implemented as a ShingleFilter sub-class, could provide this functionality.

This feature would allow for exact term set equality queries (in the case of a full-field-width window).


        Summary: Co-occurrence filter  (was: Lexically sorted shingle filter)

> Co-occurrence filter
> --------------------
>
>                 Key: LUCENE-2749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.  
> These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). 
> The parameters to this filter will be: 
> * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
> * minimum number of co-occurring terms: >= 2
> * maximum number of co-occurring terms: <= window size
> * token set ordering (lexical or positional)
> One use case for co-occurring token sets is as candidates for collocations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2749) Co-occurrence filter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006254#comment-13006254 ] 

Steven Rowe commented on LUCENE-2749:
-------------------------------------

Hi Elmar,

I haven't had a chance to do more than an hour or two of work on this, and that was a while back, so please feel free to run with it.

You should know, though, that Robert Muir and Yonik Seeley (both Lucene/Solr developers) expressed skepticism (on #lucene IRC) about whether this filter belongs in Lucene itself, because in their experience, collocations are used by non-search software, and they believe that Lucene should remain focused exclusively on search.  

Robert Muir also thinks that components that support Boolean search (i.e., not ranked search) should go elsewhere.  

I personally disagree with these restrictions in general, and I think that a co-occurrence filter could directly support search.  See this solr-user@lucene.apache.org mailing list discussion for an example I gave (and one of the reasons I made this issue): http://www.lucidimagination.com/search/document/f69f877e0fa05d17/how_do_i_this_in_solr#d9d5932e7074d356 . In this thread, I described a way to solve the original poster's problem using a co-occurrence filter exactly like the one proposed here.

I mention all this to caution you that work you put in here may never be committed to Lucene itself.

The mailing list thread I mentioned above describes the main limitations a filter like this will have: combinatoric explosion of generated terms.  I haven't figured out how to manage this, but it occurs to me that the two-term-collocation case is less problematic in this regard than the generalized case (whole-field window, all possible combinations).  I had a vague implementation conception of incrementing a fixed-width integer to iterate over the combinations, using the integer's bits to include/exclude input terms in the output "termset" tokens.  Using a 32-bit integer to track combinations would limit the length of an input token stream to 32 tokens, but in the generalized case of all combinations, I'm pretty sure that the number of bits available would not be the limiting factor, but rather the number of generated terms.  I guess the question is how to handle cases that produce fewer terms than all combinations of terms from an input token stream, e.g. the two-term-collocation case, without imposing the restrictions necessary in the generalized case.

Here are a couple of recent information retrieval papers using "termset" to mean "indexed token containing multiple input terms":

"TSS: Efficient Term Set Search in Large Peer-to-Peer Textual Collections"
http://www.cs.ust.hk/~liu/TSS-TC.pdf

"Termset-based Indexing and Query Processing in P2P Search"
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5384831

(Sorry, I couldn't find a free public location for the second paper.)

> Co-occurrence filter
> --------------------
>
>                 Key: LUCENE-2749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>
> The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.  
> These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). 
> The parameters to this filter will be: 
> * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
> * minimum number of co-occurring terms: >= 2
> * maximum number of co-occurring terms: <= window size
> * token set ordering (lexical or positional)
> One use case for co-occurring token sets is as candidates for collocations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2749) Co-occurrence filter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008353#comment-13008353 ] 

Steven Rowe commented on LUCENE-2749:
-------------------------------------

bq. Which use cases do you have in mind? 

So far just the solution I proposed in the email thread mentioned in [my previous comment|https://issues.apache.org/jira/browse/LUCENE-2749?focusedCommentId=13006254&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13006254] and the P2P distributed search use case described in the two papers mentioned in the same comment.

> Co-occurrence filter
> --------------------
>
>                 Key: LUCENE-2749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>
> The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.  
> These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). 
> The parameters to this filter will be: 
> * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
> * minimum number of co-occurring terms: >= 2
> * maximum number of co-occurring terms: <= window size
> * token set ordering (lexical or positional)
> One use case for co-occurring token sets is as candidates for collocations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2749) Co-occurrence filter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2749:
--------------------------------

    Fix Version/s:     (was: 3.1)

> Co-occurrence filter
> --------------------
>
>                 Key: LUCENE-2749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>
> The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.  
> These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). 
> The parameters to this filter will be: 
> * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
> * minimum number of co-occurring terms: >= 2
> * maximum number of co-occurring terms: <= window size
> * token set ordering (lexical or positional)
> One use case for co-occurring token sets is as candidates for collocations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org