You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Stanislaw Osinski (Created) (JIRA)" <ji...@apache.org> on 2011/11/25 09:36:39 UTC

[jira] [Created] (SOLR-2917) Support for field-specific tokenizers, token- and character filters in search results clustering

Support for field-specific tokenizers, token- and character filters in search results clustering
------------------------------------------------------------------------------------------------

Key: SOLR-2917
URL: https://issues.apache.org/jira/browse/SOLR-2917
Project: Solr
Issue Type: Improvement
Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Fix For: 3.6

Currently, Carrot2 search results clustering component creates clusters based on the raw text of a field. The reason for this is that Carrot2 aims to create meaningful cluster labels by using sequences of words taken directly from the documents' text (including stop words: _Development of Lucene and Solr_ is more readable than _Development Lucene Solr_). The easiest way of providing input for such a process was feeding Carrot2 with raw (stored) document content.

It is, however, possible to take into account +some+ of the field's filters during clustering. Because Carrot2 does not currently expose an API for feeding pre-tokenized input, the clustering component would need to:

1. get raw text of the field,
2. run it through the field's char filters, tokenizers and selected token filters (omitting e.g. stop words filter and stemmers, Carrot2 needs the original words to produce readable cluster labels),
3. glue the output back into a string and feed to Carrot2 for clustering.

In the future, to eliminate step 3, we could modify Carrot2 to accept pre-tokenized content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2917) Support for field-specific tokenizers, token- and character filters in search results clustering

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157047#comment-13157047 ] 

Uwe Schindler commented on SOLR-2917:
-------------------------------------

bq. On the other hand, the schema could define a parallel field with certain filters disabled, clustering should work nicely with such a stream.

That was the idea behind the suggestion. Highlighter works a litle bit different so it does not need this: it uses the TermVectors only for finding the highlighting offsets but marks the highligts in the original text (from a stored field). It just spares to reanalyze again, which can be expensive if you e.g. use BASIS or whatever heavy analysis.
                
> Support for field-specific tokenizers, token- and character filters in search results clustering
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2917
>                 URL: https://issues.apache.org/jira/browse/SOLR-2917
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Stanislaw Osinski
>            Assignee: Stanislaw Osinski
>             Fix For: 3.6
>
>
> Currently, Carrot2 search results clustering component creates clusters based on the raw text of a field. The reason for this is that Carrot2 aims to create meaningful cluster labels by using sequences of words taken directly from the documents' text (including stop words: _Development of Lucene and Solr_ is more readable than _Development Lucene Solr_). The easiest way of providing input for such a process was feeding Carrot2 with raw (stored) document content.
> It is, however, possible to take into account +some+ of the field's filters during clustering. Because Carrot2 does not currently expose an API for feeding pre-tokenized input, the clustering component would need to: 
> 1. get raw text of the field, 
> 2. run it through the field's char filters, tokenizers and selected token filters (omitting e.g. stop words filter and stemmers, Carrot2 needs the original words to produce readable cluster labels), 
> 3. glue the output back into a string and feed to Carrot2 for clustering. 
> In the future, to eliminate step 3, we could modify Carrot2 to accept pre-tokenized content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2917) Support for field-specific tokenizers, token- and character filters in search results clustering

Posted by "Stanislaw Osinski (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157045#comment-13157045 ] 

Stanislaw Osinski commented on SOLR-2917:
-----------------------------------------

Would a typical TVTokenStream contain stop words, original (unstemmed) forms and sentence separators? If not, the human-readability of cluster labels would suffer quite a bit. On the other hand, the schema could define a parallel field with certain filters disabled, clustering should work nicely with such a stream. Is there any other solution to this?
                
> Support for field-specific tokenizers, token- and character filters in search results clustering
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2917
>                 URL: https://issues.apache.org/jira/browse/SOLR-2917
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Stanislaw Osinski
>            Assignee: Stanislaw Osinski
>             Fix For: 3.6
>
>
> Currently, Carrot2 search results clustering component creates clusters based on the raw text of a field. The reason for this is that Carrot2 aims to create meaningful cluster labels by using sequences of words taken directly from the documents' text (including stop words: _Development of Lucene and Solr_ is more readable than _Development Lucene Solr_). The easiest way of providing input for such a process was feeding Carrot2 with raw (stored) document content.
> It is, however, possible to take into account +some+ of the field's filters during clustering. Because Carrot2 does not currently expose an API for feeding pre-tokenized input, the clustering component would need to: 
> 1. get raw text of the field, 
> 2. run it through the field's char filters, tokenizers and selected token filters (omitting e.g. stop words filter and stemmers, Carrot2 needs the original words to produce readable cluster labels), 
> 3. glue the output back into a string and feed to Carrot2 for clustering. 
> In the future, to eliminate step 3, we could modify Carrot2 to accept pre-tokenized content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2917) Support for field-specific tokenizers, token- and character filters in search results clustering

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157040#comment-13157040 ] 

Uwe Schindler commented on SOLR-2917:
-------------------------------------

By eliminating step 3, carrot could also be fed by term vectors with crazy Highlighter's TVTokenStream?
                
> Support for field-specific tokenizers, token- and character filters in search results clustering
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2917
>                 URL: https://issues.apache.org/jira/browse/SOLR-2917
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Stanislaw Osinski
>            Assignee: Stanislaw Osinski
>             Fix For: 3.6
>
>
> Currently, Carrot2 search results clustering component creates clusters based on the raw text of a field. The reason for this is that Carrot2 aims to create meaningful cluster labels by using sequences of words taken directly from the documents' text (including stop words: _Development of Lucene and Solr_ is more readable than _Development Lucene Solr_). The easiest way of providing input for such a process was feeding Carrot2 with raw (stored) document content.
> It is, however, possible to take into account +some+ of the field's filters during clustering. Because Carrot2 does not currently expose an API for feeding pre-tokenized input, the clustering component would need to: 
> 1. get raw text of the field, 
> 2. run it through the field's char filters, tokenizers and selected token filters (omitting e.g. stop words filter and stemmers, Carrot2 needs the original words to produce readable cluster labels), 
> 3. glue the output back into a string and feed to Carrot2 for clustering. 
> In the future, to eliminate step 3, we could modify Carrot2 to accept pre-tokenized content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2917) Support for field-specific tokenizers, token- and character filters in search results clustering

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157046#comment-13157046 ] 

Dawid Weiss commented on SOLR-2917:
-----------------------------------

Step 3 is necessary because we have a different tokenization pipeline in C2... but it would be a step forward to more compact integration for sure.
                
> Support for field-specific tokenizers, token- and character filters in search results clustering
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2917
>                 URL: https://issues.apache.org/jira/browse/SOLR-2917
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Stanislaw Osinski
>            Assignee: Stanislaw Osinski
>             Fix For: 3.6
>
>
> Currently, Carrot2 search results clustering component creates clusters based on the raw text of a field. The reason for this is that Carrot2 aims to create meaningful cluster labels by using sequences of words taken directly from the documents' text (including stop words: _Development of Lucene and Solr_ is more readable than _Development Lucene Solr_). The easiest way of providing input for such a process was feeding Carrot2 with raw (stored) document content.
> It is, however, possible to take into account +some+ of the field's filters during clustering. Because Carrot2 does not currently expose an API for feeding pre-tokenized input, the clustering component would need to: 
> 1. get raw text of the field, 
> 2. run it through the field's char filters, tokenizers and selected token filters (omitting e.g. stop words filter and stemmers, Carrot2 needs the original words to produce readable cluster labels), 
> 3. glue the output back into a string and feed to Carrot2 for clustering. 
> In the future, to eliminate step 3, we could modify Carrot2 to accept pre-tokenized content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2917) Support for field-specific tokenizers, token- and character filters in search results clustering

Posted by "Stanislaw Osinski (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157054#comment-13157054 ] 

Stanislaw Osinski commented on SOLR-2917:
-----------------------------------------

bq. That was the idea behind the suggestion. Highlighter works a litle bit different so it does not need this: it uses the TermVectors only for finding the highlighting offsets but marks the highligts in the original text (from a stored field). It just spares to reanalyze again, which can be expensive if you e.g. use BASIS or whatever heavy analysis.

Yeah, it's a bit different indeed because clustering would need the original text of the tokens instead of just the start offset and length. Ultimately, the choice between storing two different token streams and doing the analysis at runtime is a trade-off between storage size (doubled?) and slower runtime performance. Once we get Carrot2 to support pre-tokenized input (not hard conceptually, but tricky in terms of the API), both solutions would be possible.
                
> Support for field-specific tokenizers, token- and character filters in search results clustering
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2917
>                 URL: https://issues.apache.org/jira/browse/SOLR-2917
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Stanislaw Osinski
>            Assignee: Stanislaw Osinski
>             Fix For: 3.6
>
>
> Currently, Carrot2 search results clustering component creates clusters based on the raw text of a field. The reason for this is that Carrot2 aims to create meaningful cluster labels by using sequences of words taken directly from the documents' text (including stop words: _Development of Lucene and Solr_ is more readable than _Development Lucene Solr_). The easiest way of providing input for such a process was feeding Carrot2 with raw (stored) document content.
> It is, however, possible to take into account +some+ of the field's filters during clustering. Because Carrot2 does not currently expose an API for feeding pre-tokenized input, the clustering component would need to: 
> 1. get raw text of the field, 
> 2. run it through the field's char filters, tokenizers and selected token filters (omitting e.g. stop words filter and stemmers, Carrot2 needs the original words to produce readable cluster labels), 
> 3. glue the output back into a string and feed to Carrot2 for clustering. 
> In the future, to eliminate step 3, we could modify Carrot2 to accept pre-tokenized content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org