You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2010/03/16 06:17:36 UTC

MoreLikeThisQuery

Greets,

Lucene has a MoreLikeThisQuery in contrib:

  http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html

It functions by selecting a handful of high-value (i.e. rare) terms out of a
document and building up a composite ORQuery based on those. 

The thing that's always bothered me about its results is that it gets thrown
off by things like proper names.  

Proper names are often very rare, and thus highly discriminatory terms.  They
often pass all the heuristics that MoreLikeThisQuery uses: low doc_freq()
(meaning occurs in few documents), long token length (more than 5 characters),
etc.

The problem is that if you have e.g. two authors with the same (uncommon) last
name, but these authors write on entirely different subjects,
MoreLikeThisQuery will often conflate them.

However, there is a potential remedy available if we use clustering.  Say that
the heuristics yield this collection of terms:

    economics capital interest investment addison 
  
One of these things is not like the others.  :)  Meaning, if you look at all
those terms in a vector space, most of them will be clustered together, but
one will be way far away.

What I'd like to do is identify the cluster that best represents the document,
and exclude any terms outside of that cluster when building the
MoreLikeThisQuery.   

What kind of a data structure would we need to achieve that?

Marvin Humphrey


Re: [Lucy] Re: MoreLikeThisQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/16/10 9:02 AM:

> His suggestion was to use OpenCyc to classify terms.
> 
> That's similar to what we'd do with topic vectors generated by an indexing
> component, except that the Cyc topic vectors were built laboriously by hand
> rather than using automatic dimension reduction.

I've been looking at the SenseClusters package. It's very unfriendly to use, but
the ideas in it are worth some investigation.

http://www.d.umn.edu/~tpederse/senseclusters.html

It uses SVDPACKC to do the big matrix math:
http://netlib.org/svdpack/

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: MoreLikeThisQuery

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Mar 16, 2010 at 10:02 AM, Marvin Humphrey
<ma...@rectangular.com> wrote:
>
> Even very slow ones, eh?  How about one that requires gobs of RAM?

sure, i mean just thinking the algorithm could be prototyped first,
then made fast?

>
> This idea actually came out of a conversation I had with someone at an San
> Diego Ruby Users meeting who used to work on the OpenCyc classification
> engine.  From what I understand, the Cyc project is an AI project that sits on
> top of a kind of Yahoo directory or DMOZ for words.  Apparently it has a Java
> API and requires several GB of RAM to load.
>
>    http://www.cyc.com/cyc/opencyc
>    http://en.wikipedia.org/wiki/Cyc
>
> His suggestion was to use OpenCyc to classify terms.
>
> That's similar to what we'd do with topic vectors generated by an indexing
> component, except that the Cyc topic vectors were built laboriously by hand
> rather than using automatic dimension reduction.

at a quick glance, sounds possibly similar to
http://www.unc.edu/~jaguera/query-expansion/ (except it uses wordnet
instead)


-- 
Robert Muir
rcmuir@gmail.com

Re: MoreLikeThisQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Mar 16, 2010 at 09:01:05AM -0400, Robert Muir wrote:
> On Tue, Mar 16, 2010 at 1:17 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> > What I'd like to do is identify the cluster that best represents the document,
> > and exclude any terms outside of that cluster when building the
> > MoreLikeThisQuery.
> >
> > What kind of a data structure would we need to achieve that?

> Marvin, I use this for query expansion purposes, so if you have any
> ideas (even very slow ones) you want to test, I'd be happy to help
> with some relevance-testing gruntwork.

Even very slow ones, eh?  How about one that requires gobs of RAM?

This idea actually came out of a conversation I had with someone at an San
Diego Ruby Users meeting who used to work on the OpenCyc classification
engine.  From what I understand, the Cyc project is an AI project that sits on
top of a kind of Yahoo directory or DMOZ for words.  Apparently it has a Java
API and requires several GB of RAM to load.

    http://www.cyc.com/cyc/opencyc
    http://en.wikipedia.org/wiki/Cyc

His suggestion was to use OpenCyc to classify terms.

That's similar to what we'd do with topic vectors generated by an indexing
component, except that the Cyc topic vectors were built laboriously by hand
rather than using automatic dimension reduction.

Marvin Humphrey


Re: MoreLikeThisQuery

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Mar 16, 2010 at 1:17 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
> What I'd like to do is identify the cluster that best represents the document,
> and exclude any terms outside of that cluster when building the
> MoreLikeThisQuery.
>
> What kind of a data structure would we need to achieve that?
>
> Marvin Humphrey
>
>

Marvin, I use this for query expansion purposes, so if you have any
ideas (even very slow ones) you want to test, I'd be happy to help
with some relevance-testing gruntwork.

-- 
Robert Muir
rcmuir@gmail.com

Re: [Lucy] MoreLikeThisQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/16/10 12:17 AM:

> However, there is a potential remedy available if we use clustering.  Say that
> the heuristics yield this collection of terms:
> 
>     economics capital interest investment addison 
>   
> One of these things is not like the others.  :)  Meaning, if you look at all
> those terms in a vector space, most of them will be clustered together, but
> one will be way far away.

wouldn't POS tagging achieve the same ends? Or even a look-up lexicon of nouns?


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

C header file location

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Thu, Mar 18, 2010 at 09:54:18PM -0500, Peter Karman wrote:
> > The easy way to handle things would be the same way we're going with
> > ProximityScorer, glomming onto the KinoSearch build directly.  But we could
> > also try jamming stuff into /usr/local/include and worry about permissions
> > issues and portability later.
> 
> maybe parse 'ccflags' from $Config ? that seems to include the -I paths for the
> perl binary itself, which might be more install-specific than assuming
> /usr/local/include.

Prior to the adoption of our new backwards compatibility policy, I would have
thought it necessary to install headers into different locations for each
host.  So if you had KS/Lucy installed with Perl 5.8.9, Perl 5.10.1, and Ruby
1.9.1, that would mean 3 different include directories with separate copies of
our header files.

Now, though, I think /usr/local/include or its closest equivalent is the right
place.  The unstable branch is only meant to be used on machines where the
user has control over what gets installed; it simply won't be safe to have
multiple hosts with out-of-sync versions and users will have to make sure they
"don't do that".  And stable forks like KinoSearch3 will have stable C APIs:
so long as we don't accidentally force-downgrade by overwriting KinoSearch3
version 3.01 with version 3.00, it will be safe to install them into a shared
location.

Hmm, as a matter of fact, I think it now makes sense to install our compiled
shared objects into /usr/local/lib, like e.g. Swish does.

Marvin Humphrey


Re: [Lucy] Re: MoreLikeThisQuery

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 3/17/10 12:37 PM:

> The easy way to handle things would be the same way we're going with
> ProximityScorer, glomming onto the KinoSearch build directly.  But we could
> also try jamming stuff into /usr/local/include and worry about permissions
> issues and portability later.

maybe parse 'ccflags' from $Config ? that seems to include the -I paths for the
perl binary itself, which might be more install-specific than assuming
/usr/local/include.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: MoreLikeThisQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Mar 17, 2010 at 02:56:15PM +0100, Nick Wellnhofer wrote:
> How does Kinosearch compute the final term weights? I had a
> look at Search/Similarity.c and it seems to be Sim_tf * Sim_idf *
> Sim_length_norm.

The Lucene folks have put a good amount of effort into documenting the Lucene
scoring model, which current KS implements.

   http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/Similarity.html

As for where that stuff gets applied... it's kind of all over the place.

> On a side note, how can I interface with Kinosearch or Lucy directly on
> the C level? Is there any documentation yet?

There's not a C API right yet.

To get things up and running, the first thing we would need to do is put all
the C header files somewhere.  That's kind of tricky on Unixen because the
CPAN module installation process doesn't normally touch /usr/local/include,
and on Windows, it's trickier still.

The easy way to handle things would be the same way we're going with
ProximityScorer, glomming onto the KinoSearch build directly.  But we could
also try jamming stuff into /usr/local/include and worry about permissions
issues and portability later.

Marvin Humphrey


Re: MoreLikeThisQuery

Posted by Nick Wellnhofer <we...@aevum.de>.
On 16.03.2010 22:44, Marvin Humphrey wrote:
> On Tue, Mar 16, 2010 at 04:15:40PM +0100, Nick Wellnhofer wrote:
>> What's the easiest way to get to the > term-document matrix either during or
>> after indexing?
> 
> I'm not sure what format would be most helpful for you.  Here's code to
> iterate over all terms and all postings in all segments for the "content"
> field:
> 
>   my $poly_reader = KinoSearch::Index::PolyReader->open( 
>     index => '/path/to/index',
>   );  
>   my %postings;
>   my $offset = 0;
>   for my $seg_reader ( @{ $poly_reader->seg_readers } ) { 
>     my $lex_reader = $seg_reader->obtain("KinoSearch::Index::LexiconReader");
>     my $plist_reader
>       = $seg_reader->obtain("KinoSearch::Index::PostingListReader");
>     my $lexicon = $lex_reader->lexicon( field => 'content');
>     my $plist = $plist_reader->posting_list( field => 'content' );
>     while ($lexicon->next) {
>       my $term = $lexicon->get_term;
>       warn $term;
>       $postings{$term} ||= []; 
>       my $doc_id_array = $postings{$term};
>       $plist->seek($term);
>       while (my $seg_doc_id = $plist->next) {
>         push @$doc_id_array, $seg_doc_id + offset;
>       }   
>     }   
>     $offset += $seg_reader->doc_max;
>   }
> 
> Does that at least provide a point of departure?

Thanks, that's helpful. I also figured out how to get the term
frequencies. How does Kinosearch compute the final term weights? I had a
look at Search/Similarity.c and it seems to be Sim_tf * Sim_idf *
Sim_length_norm.

On a side note, how can I interface with Kinosearch or Lucy directly on
the C level? Is there any documentation yet?

Nick

-- 
aevum gmbh
rumfordstr. 4
80469 münchen
germany

tel: +49 89 3838 0653
http://aevum.de/

Re: MoreLikeThisQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
Meh, I left debugging code in the code sample from that last email.

      my $plist = $plist_reader->posting_list( field => 'content' );
      while ($lexicon->next) {
        my $term = $lexicon->get_term;
-       warn $term;
        $postings{$term} ||= []; 
        my $doc_id_array = $postings{$term};


The point is to build up that %postings hash, not to print out terms.  Do
whatever you want with %postings afterwards.

Marvin Humphrey


Re: MoreLikeThisQuery

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Mar 16, 2010 at 04:15:40PM +0100, Nick Wellnhofer wrote:
> What's the easiest way to get to the > term-document matrix either during or
> after indexing?

I'm not sure what format would be most helpful for you.  Here's code to
iterate over all terms and all postings in all segments for the "content"
field:

  my $poly_reader = KinoSearch::Index::PolyReader->open( 
    index => '/path/to/index',
  );  
  my %postings;
  my $offset = 0;
  for my $seg_reader ( @{ $poly_reader->seg_readers } ) { 
    my $lex_reader = $seg_reader->obtain("KinoSearch::Index::LexiconReader");
    my $plist_reader
      = $seg_reader->obtain("KinoSearch::Index::PostingListReader");
    my $lexicon = $lex_reader->lexicon( field => 'content');
    my $plist = $plist_reader->posting_list( field => 'content' );
    while ($lexicon->next) {
      my $term = $lexicon->get_term;
      warn $term;
      $postings{$term} ||= []; 
      my $doc_id_array = $postings{$term};
      $plist->seek($term);
      while (my $seg_doc_id = $plist->next) {
        push @$doc_id_array, $seg_doc_id + offset;
      }   
    }   
    $offset += $seg_reader->doc_max;
  }

Does that at least provide a point of departure?

> I'm not sure clustering really helps here. Suppose that each half of the
> search terms is from one of two clusters both of which are relevant to
> the query. Do you really want to exclude one of the clusters?

The number one goal is to exclude high-value terms which are outliers.  So  
long as that is achieved, we will remove many painfully wrong results.

Beyond that, my intuition is that whether we focus on one cluster or allow
multiple clusters is less important.  Focusing on one cluster might improve
precision at the expense of recall -- or it might just hurt recall, I don't
know.  Relevance testing of the kind that Robert is talking about could help
us determine that.

Marvin Humphrey


Re: MoreLikeThisQuery

Posted by Nick Wellnhofer <we...@aevum.de>.
On 16.03.2010 06:17, Marvin Humphrey wrote:
> Greets,
> 
> Lucene has a MoreLikeThisQuery in contrib:
> 
>   http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html
> 
> It functions by selecting a handful of high-value (i.e. rare) terms out of a
> document and building up a composite ORQuery based on those. 

Yes, it looks like it's simply an approximation to the cosine similarity
of the term vectors.

> The problem is that if you have e.g. two authors with the same (uncommon) last
> name, but these authors write on entirely different subjects,
> MoreLikeThisQuery will often conflate them.

I think a remedy for this problem is a dimensionality reduction of the
term-document matrix like LSA does. Maybe I'm going to experiment with
that a little in the next weeks. What's the easiest way to get to the
term-document matrix either during or after indexing?

> However, there is a potential remedy available if we use clustering.  Say that
> the heuristics yield this collection of terms:
> 
>     economics capital interest investment addison 
>   
> One of these things is not like the others.  :)  Meaning, if you look at all
> those terms in a vector space, most of them will be clustered together, but
> one will be way far away.
> 
> What I'd like to do is identify the cluster that best represents the document,
> and exclude any terms outside of that cluster when building the
> MoreLikeThisQuery.   

I'm not sure clustering really helps here. Suppose that each half of the
search terms is from one of two clusters both of which are relevant to
the query. Do you really want to exclude one of the clusters?

Nick


-- 
aevum gmbh
rumfordstr. 4
80469 münchen
germany

tel: +49 89 3838 0653
http://aevum.de/