You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark Bennett <mb...@ideaeng.com> on 2009/08/11 18:44:50 UTC

Solr 1.4 Clustering / mlt AS search?

I'm going somewhere with this... be patient.  :-)  I had asked about this
briefly at the SF meetup, but there was a lot going on.

1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering was all
in, and you had built the cluster index for all your docs.

2: Then, if you had a particular cluster, and one of the docs in that
cluster happened to be your search, then the other documents in the cluster
could be considered the results.  In effect, the cluster is like the search
results.

3: Now imagine you can take an arbitrary doc and find the clusters that
document is in.  (some clustering engines let you do this).

4: And then imagine that, when somebody submits a search, you quickly turn
it into a document, add it to the index, redo the clusters, find the
clusters this new temp doc is in, and use that as the results.

Benefits?

I'm not saying this would be practical, but would it be useful?  Or, in
particular, would it be more useful than the normal Solr/Lucene relevancy?
As I recall Carrot^2 had 3 choices for clustering.

And let's assume that the searches coming in are more than the 1.4 words
average.  Maybe a few sentences or something.  I'm mot sure a 1 word query
would really benefit from this.  :-)

Some clustering algorithms don't allow you to find a cluster containing a
specific document, so those wouldn't work as a "search engine".

More Like This as a "cluster" search?

A similar scenario could be made for the "more like this" feature.  Take a
user's search text (presumably lengthy), quickly index it, then use that new
temp doc as a MLT seed doc.  I haven't looked deep into the code, it might
be that it uses essentially the same relevancy as a query.

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Solr 1.4 Clustering / mlt AS search?

Posted by Mark Bennett <mb...@ideaeng.com>.
With regards my second question, re. More Like this, I do see:
"The MoreLikeThisHandler can also use a ContentStream to find similar
documents. It will extract the "interesting terms" from the posted text."
at http://wiki.apache.org/solr/MoreLikeThisHandler
and that it uses the TF/IDF stuff.

Still wondering if anybody's tried MLK or Carrot clustering as a primary
search entry point.

On Tue, Aug 11, 2009 at 9:44 AM, Mark Bennett <mb...@ideaeng.com> wrote:

> I'm going somewhere with this... be patient.  :-)  I had asked about this
> briefly at the SF meetup, but there was a lot going on.
>
> 1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering was
> all in, and you had built the cluster index for all your docs.
>
> 2: Then, if you had a particular cluster, and one of the docs in that
> cluster happened to be your search, then the other documents in the cluster
> could be considered the results.  In effect, the cluster is like the search
> results.
>
> 3: Now imagine you can take an arbitrary doc and find the clusters that
> document is in.  (some clustering engines let you do this).
>
> 4: And then imagine that, when somebody submits a search, you quickly turn
> it into a document, add it to the index, redo the clusters, find the
> clusters this new temp doc is in, and use that as the results.
>
> Benefits?
>
> I'm not saying this would be practical, but would it be useful?  Or, in
> particular, would it be more useful than the normal Solr/Lucene relevancy?
> As I recall Carrot^2 had 3 choices for clustering.
>
> And let's assume that the searches coming in are more than the 1.4 words
> average.  Maybe a few sentences or something.  I'm mot sure a 1 word query
> would really benefit from this.  :-)
>
> Some clustering algorithms don't allow you to find a cluster containing a
> specific document, so those wouldn't work as a "search engine".
>
> More Like This as a "cluster" search?
>
> A similar scenario could be made for the "more like this" feature.  Take a
> user's search text (presumably lengthy), quickly index it, then use that new
> temp doc as a MLT seed doc.  I haven't looked deep into the code, it might
> be that it uses essentially the same relevancy as a query.
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>

Re: Solr 1.4 Clustering / mlt AS search?

Posted by Stanislaw Osinski <st...@osinski.name>.
Hi,

On Thu, Aug 13, 2009 at 19:29, Mark Bennett <mb...@ideaeng.com> wrote:

There are comments in the Solr materials about having an option to cluster
> based on the entire document set, and some warning about this being
> atypical
> and possibly slow.  And from what you're saying, for a big enough docset,
> it
> might go from "slow" to "impossible", I'm not sure.


For Carrot2, it would go to "impossible" I'd say. But as Grant mentioned
earlier, Mahout is developing clustering algorithms that should be able to
handle the whole-index types of docsets.

And so my question was, *if* you were willing to spend that much time and
> effort to cluster all the text of all the documents (and if it were even
> possible), would the result perform better than the standard TF/IDF
> techniques?


Depends on the algorithm, really. In case of Carrot2, we don't do re-ranking
of documents within clusters, we simply use whatever document order we got
on input. As far as I'm aware, most clustering algorithms do pretty much the
same: they concentrate on finding groups of documents and don't delve much
into the issues of ranking documents within clusters.


> In the application I'm considering, the queries tend to be longer than
> average, more like full sentences or more.  And they tend to be of a
> question and answer nature.  I've seen references in several search engines
> that QandA search sometimes benefits from alternative search techniques.
> And, from a separate email, the IDF part of the standard similarity may be
> causing a problem, so I'm casting a wide net for other ideas.  Just
> brainstorming here... :-)


Because of what I described above, clustering the whole index may not give
you the best results. But you can try something different. You could try
fetching a bunch (100--500) of more or less relevant documents for the
question (MLT should be fine to start with), add your question as an extra
document, perform clustering and see where the question-document ends up. If
it doesn't end up in the Other Topics cluster, you could examine if the
other documents from the cluster give an answer to the question. In this
scenario, Carrot2 should be fine, at least performance-wise. I've not
followed the QA literature very closely, so it's hard to say what the
results would be quality-wise, but it should be very quick to try. Carrot2
Clustering Workbench [1][2] may come in handy for the experiments too.

S.

[1] http://download.carrot2.org/head/manual/#section.workbench
[2]
http://download.carrot2.org/head/manual/#section.getting-started.xml-files

Re: Solr 1.4 Clustering / mlt AS search?

Posted by Grant Ingersoll <gs...@apache.org>.
On Aug 13, 2009, at 1:29 PM, Mark Bennett wrote:

> * mlb: comments
>
> On Thu, Aug 13, 2009 at 9:39 AM, Stanislaw Osinski  
> <st...@gmail.com>wrote:
>
>> Hi,
>>
>> On Tue, Aug 11, 2009 at 22:19, Mark Bennett <mb...@ideaeng.com>  
>> wrote:
>>
>> Carrot2 has several pluggable algorithms to choose from, though I  
>> have no
>>> evidence that they're "better" than Lucene's.  Where TF/IDF is  
>>> sort of a
>>> one
>>> step algebraic calculation, some clustering algorithms use iterative
>>> approaches, etc.
>>
>>
>> I'm not sure if I completely follow the way in which you'd like to  
>> use
>> Carrot2 for scoring -- would you cluster the whole index? Carrot2 was
>> designed to be a post-retrieval clustering algorithm and optimized to
>> cluster small sets of documents (up to ~1000) in real time. All  
>> processing
>> is performed in-memory, which limits Carrot2's applicability to  
>> really
>> large
>> sets of documents.
>>
>> S.
>>
>
> * mlb: I agree with all of your assertions, but...
>
> There are comments in the Solr materials about having an option to  
> cluster
> based on the entire document set, and some warning about this being  
> atypical
> and possibly slow.  And from what you're saying, for a big enough  
> docset, it
> might go from "slow" to "impossible", I'm not sure.

Those comments are referring to a yet unimplemented feature that will  
allow for pluggable background clustering using something like Mahout  
to cluster the whole collection and then return back the results later  
upon request.


>
> And so my question was, *if* you were willing to spend that much  
> time and
> effort to cluster all the text of all the documents (and if it were  
> even
> possible), would the result perform better than the standard TF/IDF
> techniques?
>
> In the application I'm considering, the queries tend to be longer than
> average, more like full sentences or more.  And they tend to be of a
> question and answer nature.  I've seen references in several search  
> engines
> that QandA search sometimes benefits from alternative search  
> techniques.
> And, from a separate email, the IDF part of the standard similarity  
> may be
> causing a problem, so I'm casting a wide net for other ideas.  Just
> brainstorming here... :-)

QA has a lot of factors at play, but I can't recall anyone using  
clustering as a way of doing the initial passage retrieval, but it's  
been a few years since I kept up with that literature.

You of course can turn off or downplay IDF if that is an issue.   I  
think payloads can also play a useful hand in QA (or Lucene's new  
Attribute capabilities, but I won't quite go there yet) because you  
could store term level information (often POS plays a role in helping  
QA, as well as parsing information)


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Solr 1.4 Clustering / mlt AS search?

Posted by Mark Bennett <mb...@ideaeng.com>.
* mlb: comments

On Thu, Aug 13, 2009 at 9:39 AM, Stanislaw Osinski <st...@gmail.com>wrote:

> Hi,
>
> On Tue, Aug 11, 2009 at 22:19, Mark Bennett <mb...@ideaeng.com> wrote:
>
> Carrot2 has several pluggable algorithms to choose from, though I have no
> > evidence that they're "better" than Lucene's.  Where TF/IDF is sort of a
> > one
> > step algebraic calculation, some clustering algorithms use iterative
> > approaches, etc.
>
>
> I'm not sure if I completely follow the way in which you'd like to use
> Carrot2 for scoring -- would you cluster the whole index? Carrot2 was
> designed to be a post-retrieval clustering algorithm and optimized to
> cluster small sets of documents (up to ~1000) in real time. All processing
> is performed in-memory, which limits Carrot2's applicability to really
> large
> sets of documents.
>
> S.
>

* mlb: I agree with all of your assertions, but...

There are comments in the Solr materials about having an option to cluster
based on the entire document set, and some warning about this being atypical
and possibly slow.  And from what you're saying, for a big enough docset, it
might go from "slow" to "impossible", I'm not sure.

And so my question was, *if* you were willing to spend that much time and
effort to cluster all the text of all the documents (and if it were even
possible), would the result perform better than the standard TF/IDF
techniques?

In the application I'm considering, the queries tend to be longer than
average, more like full sentences or more.  And they tend to be of a
question and answer nature.  I've seen references in several search engines
that QandA search sometimes benefits from alternative search techniques.
And, from a separate email, the IDF part of the standard similarity may be
causing a problem, so I'm casting a wide net for other ideas.  Just
brainstorming here... :-)

So, given that, did you have any thoughts on it Stanislaw?
Mark

Re: Solr 1.4 Clustering / mlt AS search?

Posted by Stanislaw Osinski <st...@gmail.com>.
Hi,

On Tue, Aug 11, 2009 at 22:19, Mark Bennett <mb...@ideaeng.com> wrote:

Carrot2 has several pluggable algorithms to choose from, though I have no
> evidence that they're "better" than Lucene's.  Where TF/IDF is sort of a
> one
> step algebraic calculation, some clustering algorithms use iterative
> approaches, etc.


I'm not sure if I completely follow the way in which you'd like to use
Carrot2 for scoring -- would you cluster the whole index? Carrot2 was
designed to be a post-retrieval clustering algorithm and optimized to
cluster small sets of documents (up to ~1000) in real time. All processing
is performed in-memory, which limits Carrot2's applicability to really large
sets of documents.

S.

Re: Solr 1.4 Clustering / mlt AS search?

Posted by Mark Bennett <mb...@ideaeng.com>.
Thanks Grant.

*** mlb: comments inline

On Tue, Aug 11, 2009 at 12:40 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Inline...
>
> On Aug 11, 2009, at 12:44 PM, Mark Bennett wrote:
>
>  I'm going somewhere with this... be patient.  :-)  I had asked about this
>> briefly at the SF meetup, but there was a lot going on.
>>
>> 1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering was
>> all
>> in, and you had built the cluster index for all your docs.
>>
>> 2: Then, if you had a particular cluster, and one of the docs in that
>> cluster happened to be your search, then the other documents in the
>> cluster
>> could be considered the results.  In effect, the cluster is like the
>> search
>> results.
>>
>> 3: Now imagine you can take an arbitrary doc and find the clusters that
>> document is in.  (some clustering engines let you do this).
>>
>> 4: And then imagine that, when somebody submits a search, you quickly turn
>> it into a document, add it to the index, redo the clusters, find the
>> clusters this new temp doc is in, and use that as the results.
>>
>>
> I guess I'd argue that this is already what Lucene does, except for the
> part about adding the query into the document set.  The Lucene Query is just
> your arbitrary document.  Really, the primary difference as I see it, I
> think, is that you want a the Carrot2 scoring mechanism instead of the
> existing Lucene one, no?  Otherwise, I don't see much benefit to actually
> indexing the query, other than it could potentially be used to skew results
> over time as people ask the same queries over and over again.


*** mlb: Yes, this is essentially what I'm suggesting.

Carrot2 has several pluggable algorithms to choose from, though I have no
evidence that they're "better" than Lucene's.  Where TF/IDF is sort of a one
step algebraic calculation, some clustering algorithms use iterative
approaches, etc.


>
>
> Under a certain lens, couldn't you just argue that search is finding all
> the docs that cluster around your query?  (I know that isn't the traditional
> description, but regardless, the math underneath is often very similar)
>

*** mlb: Yes, exactly.  And so the question is might some of these other
methods work better for certain applications, certain vocabularies, etc.

So I guess it's about flexibility, etc.  Though you can plugin your own
similarity class, that's still the one shot algebraic model, regardless of
the specific formulas.  Some of the newer machine learning algorithms have
other tricks up their sleeves that might fit some usage models better.


>
>
>
>  Benefits?
>>
>> I'm not saying this would be practical, but would it be useful?  Or, in
>> particular, would it be more useful than the normal Solr/Lucene relevancy?
>> As I recall Carrot^2 had 3 choices for clustering.
>>
>
>
>> And let's assume that the searches coming in are more than the 1.4 words
>> average.  Maybe a few sentences or something.  I'm mot sure a 1 word query
>> would really benefit from this.  :-)
>>
>> Some clustering algorithms don't allow you to find a cluster containing a
>> specific document, so those wouldn't work as a "search engine".
>>
>> More Like This as a "cluster" search?
>>
>> A similar scenario could be made for the "more like this" feature.  Take a
>> user's search text (presumably lengthy), quickly index it, then use that
>> new
>> temp doc as a MLT seed doc.  I haven't looked deep into the code, it might
>> be that it uses essentially the same relevancy as a query.
>>
>
> Again, I don't see the benefit of indexing it.  You slightly peturb the
> corpus statistics, but other than that, how is it different from just
> submitting the query and getting back the results?


*** Yeah, actually I'm not wild about changing the index for the sake of
processing a search.  And looking at MLT, they claim you can send in a
stream, so no need to update the index.

Re: Solr 1.4 Clustering / mlt AS search?

Posted by Grant Ingersoll <gs...@apache.org>.
Inline...

On Aug 11, 2009, at 12:44 PM, Mark Bennett wrote:

> I'm going somewhere with this... be patient.  :-)  I had asked about  
> this
> briefly at the SF meetup, but there was a lot going on.
>
> 1: Suppose you had Solr 1.4 and all the Carrot^2 DOCUMENT clustering  
> was all
> in, and you had built the cluster index for all your docs.
>
> 2: Then, if you had a particular cluster, and one of the docs in that
> cluster happened to be your search, then the other documents in the  
> cluster
> could be considered the results.  In effect, the cluster is like the  
> search
> results.
>
> 3: Now imagine you can take an arbitrary doc and find the clusters  
> that
> document is in.  (some clustering engines let you do this).
>
> 4: And then imagine that, when somebody submits a search, you  
> quickly turn
> it into a document, add it to the index, redo the clusters, find the
> clusters this new temp doc is in, and use that as the results.
>

I guess I'd argue that this is already what Lucene does, except for  
the part about adding the query into the document set.  The Lucene  
Query is just your arbitrary document.  Really, the primary difference  
as I see it, I think, is that you want a the Carrot2 scoring mechanism  
instead of the
existing Lucene one, no?  Otherwise, I don't see much benefit to  
actually indexing the query, other than it could potentially be used  
to skew results over time as people ask the same queries over and over  
again.

Under a certain lens, couldn't you just argue that search is finding  
all the docs that cluster around your query?  (I know that isn't the  
traditional description, but regardless, the math underneath is often  
very similar)


> Benefits?
>
> I'm not saying this would be practical, but would it be useful?  Or,  
> in
> particular, would it be more useful than the normal Solr/Lucene  
> relevancy?
> As I recall Carrot^2 had 3 choices for clustering.

>
> And let's assume that the searches coming in are more than the 1.4  
> words
> average.  Maybe a few sentences or something.  I'm mot sure a 1 word  
> query
> would really benefit from this.  :-)
>
> Some clustering algorithms don't allow you to find a cluster  
> containing a
> specific document, so those wouldn't work as a "search engine".
>
> More Like This as a "cluster" search?
>
> A similar scenario could be made for the "more like this" feature.   
> Take a
> user's search text (presumably lengthy), quickly index it, then use  
> that new
> temp doc as a MLT seed doc.  I haven't looked deep into the code, it  
> might
> be that it uses essentially the same relevancy as a query.

Again, I don't see the benefit of indexing it.  You slightly peturb  
the corpus statistics, but other than that, how is it different from  
just submitting the query and getting back the results?