You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2010/01/02 18:11:54 UTC

[jira] Created: (SOLR-1692) CarrotClusteringEngine produce summary does nothing

CarrotClusteringEngine produce summary does nothing
---------------------------------------------------

                 Key: SOLR-1692
                 URL: https://issues.apache.org/jira/browse/SOLR-1692
             Project: Solr
          Issue Type: Bug
          Components: contrib - Clustering
            Reporter: Grant Ingersoll
            Assignee: Grant Ingersoll
             Fix For: 1.5


In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1692) CarrotClusteringEngine produce summary does nothing

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795925#action_12795925 ] 

Stanislaw Osinski commented on SOLR-1692:
-----------------------------------------

{quote}
bq. Where should the configuration of the highlighter we use for clustering come from?

We have all the code hooked in for it already, we're just ignoring the output.
{quote}

To avoid confusion and questions along the lines of "why clusters don't match the (highlighted) documents I'm seeing", I'd suggest a slightly more elaborate scenario for the clustering highlighter configuration:

1. If main Solr highlighting is disabled, use the clustering component's highlighter settings.
2. If main Solr highlighting is enabled, use the main highlighter's configuration as the defaults and let the clustering-specific highlighter configuration override the defaults.

If we do it this way, we'll minimize the chances of users accidentally performing clustering on documents different (differently highlighted) than those they will see.

bq. Would be great if, Carrot2 could also just use the analysis that Lucene/Solr produces, that way it would be much easier to configure stopwords, HTML stripping, etc.

This one would require some larger changes to Carrot2 internals. We do use Lucene infrastructure for preprocessing (currently for tokenization), but I can investigate if we can extend that further. A potential problem here is that very often the set of stopwords you use for document retrieval may not work equally well for clustering. I've filed a [Carrot2-specific issue|http://issues.carrot2.org/browse/CARROT-606] for it and will try to come up with something.

> CarrotClusteringEngine produce summary does nothing
> ---------------------------------------------------
>
>                 Key: SOLR-1692
>                 URL: https://issues.apache.org/jira/browse/SOLR-1692
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 1.5
>
>         Attachments: SOLR-1692.patch
>
>
> In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1692) CarrotClusteringEngine produce summary does nothing

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-1692:
----------------------------------

    Attachment: SOLR-1692.patch

Fixes the bug, adds new parameter to specify the frag size when using the highlighter.

> CarrotClusteringEngine produce summary does nothing
> ---------------------------------------------------
>
>                 Key: SOLR-1692
>                 URL: https://issues.apache.org/jira/browse/SOLR-1692
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 1.5
>
>         Attachments: SOLR-1692.patch
>
>
> In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1692) CarrotClusteringEngine produce summary does nothing

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795914#action_12795914 ] 

Stanislaw Osinski commented on SOLR-1692:
-----------------------------------------

I've had a quick look into this issue and have two questions to consider:

* Where should the configuration of the highlighter we use for clustering come from? Should it be the same as for the regular Solr highlighting or should we allow a clustering-specific configuration? My intuition is that we should go with the former. Otherwise, we may lose the clear relationship between cluster labels and documents on the output, because the clusters will be generated based on a text that is different from what the user is going to see.

* What should we do if the highlighter is not able to generate a summary? One option is to use the full contents of the field. Alternatively, we could use N (configurable) first characters of the field. The answer to this really depends on the characteristics of the data we may get. If the total number of documents fed to Carrot2 doesn't exceed about a 1000, longer documents shouldn't be too much of a problem, so I'd suggest the former option (use full field text).

> CarrotClusteringEngine produce summary does nothing
> ---------------------------------------------------
>
>                 Key: SOLR-1692
>                 URL: https://issues.apache.org/jira/browse/SOLR-1692
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 1.5
>
>
> In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1692) CarrotClusteringEngine produce summary does nothing

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795873#action_12795873 ] 

Grant Ingersoll commented on SOLR-1692:
---------------------------------------

The relevant lines are:
{code}
String snippet = getValue(doc, snippetField);
if (produceSummary == true) {
        docsHolder[0] = id.intValue();
        DocList docAsList = new DocSlice(0, 1, docsHolder, scores, 1, 1.0f);
        highligher.doHighlighting(docAsList, theQuery, req, snippetFieldAry);
      }
{code}

It seems like we do the highlighting but then don't use the result.  If I recall, we should use the result to then set the snippet value.

> CarrotClusteringEngine produce summary does nothing
> ---------------------------------------------------
>
>                 Key: SOLR-1692
>                 URL: https://issues.apache.org/jira/browse/SOLR-1692
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 1.5
>
>
> In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1692) CarrotClusteringEngine produce summary does nothing

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795919#action_12795919 ] 

Grant Ingersoll commented on SOLR-1692:
---------------------------------------

bq. Where should the configuration of the highlighter we use for clustering come from?

We have all the code hooked in for it already, we're just ignoring the output.

bq. What should we do if the highlighter is not able to generate a summary?

I think we can default to the full contents, which is what would be used if you don't specify produceSummary.  We can handle the char thing separately, I suppose.

Would be great if, Carrot2 could also just use the analysis that Lucene/Solr produces, that way it would be much easier to configure stopwords, HTML stripping, etc.

> CarrotClusteringEngine produce summary does nothing
> ---------------------------------------------------
>
>                 Key: SOLR-1692
>                 URL: https://issues.apache.org/jira/browse/SOLR-1692
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 1.5
>
>
> In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.