You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2009/02/10 15:45:05 UTC
[jira] Updated: (SOLR-769) Support Document and Search Result clustering

     [ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-769:
---------------------------------

    Attachment: SOLR-769.patch

Here's a patch for Carrot2 3.0 that COMPILES ONLY.  
You will need to download the clustering-libs.tar.gz from http://people.apache.org/~gsingers/clustering-libs.tar.gz as it is too big to upload to JIRA.

TODO:
1. Tests passing and more tests
2. Update NOTICE.txt and LICENSE.txt
3. Get trimmed down Carrot2 library that doesn't have all the Document Source dependencies, and preferably the web services deps.  Solr doesn't need the Google, etc. API deps.  Preferably remove the LGPL deps too, but for now, they are downloaded via ANT from the Maven repositories.
4. Update the Maven template
5. Hook in the builds
6. Make sure the example works

> Support Document and Search Result clustering
> ---------------------------------------------
>
>                 Key: SOLR-769
>                 URL: https://issues.apache.org/jira/browse/SOLR-769
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch
>
>
> Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting.  Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering.  Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering.  
> The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot.  In search results mode, it will use the DocList as the input for the cluster.   While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results.
> While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters.  Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters.  I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib.  It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.