You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Russell M. Allen" <Ru...@aebn.net> on 2006/07/27 18:02:46 UTC

Scoring a document (count?)

I am curious about the potential use of document scoring as a means to
extract additional data from an index.  Specifically, I would like the
score to be a count of how many times a particular field matched a set
of terms.
 
For example, I am indexing movie-stars (Each document is a movie-star).
A movie-star has a number of fields, such as name, movies they have been
in, etc.  I want to produce an 'index' of stars by name and show how
many movies, which match a filter, that they have appeared in.

In natural language my query might be: 
	"List all stars who have appeared in a 'horror' movie, where
last name starts with A, and tell me how many horror movies they were
in."

My search will look something like this:  
	"+lastName:A* +movie:(1 7 21 58 92)"	//where movie is a
previously computed list of 'horror' movie ids

If my index contained the following documents:
    doc1 = lastName:Anna   movie:{3 10}
    doc2 = lastName:Aba    movie:{1 10 12}
    doc3 = lastName:Addd   movie:{3 21 55 92}
    doc4 = lastName:Baaa   movie:{7 56}

I would like to get back:
    doc2, score of 1	//score of 1 because only movie 1 matched
    doc3, score of 2	//score of 2 because movies 21 and 92 matched



Currently, we perform an initial query against our Star index to
retrieve a list of stars.  Then we perform N queries against a separate
movie index to count the number of movies that match our sub filter
'horror'.  This is obviously very inefficient, and as I've shown above,
the information (count) is available during the primary query.

Thoughts?




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scoring a document (count?)

Posted by "Russell M. Allen" <Ru...@aebn.net>.

Thank you for the reply.

I am certainly open to different ways of organizing / indexing our
documents.  However, the example I provided was simplified for the sake
of the discussion.  In truth, what I was calling a category may be an
arbitrary set of movie ids (determined by a previous query).  This
precludes 'burning in' the associations as independently indexed fields.

I'd like to take a whack at the scorer approach.  I've read through most
of the Lucene web site and have reviewed the source code quite a bit
lately.  However, I admit I am still a little lost in how lucene works
under the covers.  Are there any design documents available to give me a
head start?  Is the Lucene in Action book the only source of
information?  Does it discuss how Lucene works under the covers?

Thanks!
-Russell

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Monday, July 31, 2006 4:02 AM
To: Lucene Users
Subject: Re: Scoring a document (count?)


it would certainly be possible to get a score that was a simple count of
the number of matching clauses of a boolean query -- probably just with
a modified Similarity (no coord, 1/0 tf, no idf, no norms) but you
*might* need a slightly modified TermScorer to do that.

In general though, i think you are solving your problem the wrong way
...
don't just put the movie Ids in the movie-star docs ... also have one
indexed/stored field per category of movie (ie: "horror" would be an
indexed
field) that would only be set on actors which have appeared in a movie
of that type -- the value of the field would be the number of movies
they have appeared in of that type.

now you do your main query, with a simple filter on the "horror" field
to ensure it has a value and you've got the stored value of the "horror"
field to tell you how many movies they've been in.




: Date: Thu, 27 Jul 2006 12:02:46 -0400
: From: Russell M. Allen <Ru...@aebn.net>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Scoring a document (count?)
:
: I am curious about the potential use of document scoring as a means to
: extract additional data from an index.  Specifically, I would like the
: score to be a count of how many times a particular field matched a set
: of terms.
:
: For example, I am indexing movie-stars (Each document is a
movie-star).
: A movie-star has a number of fields, such as name, movies they have
been
: in, etc.  I want to produce an 'index' of stars by name and show how
: many movies, which match a filter, that they have appeared in.
:
: In natural language my query might be:
: 	"List all stars who have appeared in a 'horror' movie, where
: last name starts with A, and tell me how many horror movies they were
: in."
:
: My search will look something like this:
: 	"+lastName:A* +movie:(1 7 21 58 92)"	//where movie is a
: previously computed list of 'horror' movie ids
:
: If my index contained the following documents:
:     doc1 = lastName:Anna   movie:{3 10}
:     doc2 = lastName:Aba    movie:{1 10 12}
:     doc3 = lastName:Addd   movie:{3 21 55 92}
:     doc4 = lastName:Baaa   movie:{7 56}
:
: I would like to get back:
:     doc2, score of 1	//score of 1 because only movie 1 matched
:     doc3, score of 2	//score of 2 because movies 21 and 92 matched
:
:
:
: Currently, we perform an initial query against our Star index to
: retrieve a list of stars.  Then we perform N queries against a
separate
: movie index to count the number of movies that match our sub filter
: 'horror'.  This is obviously very inefficient, and as I've shown
above,
: the information (count) is available during the primary query.
:
: Thoughts?
:
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring a document (count?)

Posted by Chris Hostetter <ho...@fucit.org>.

it would certainly be possible to get a score that was a simple count of
the number of matching clauses of a boolean query -- probably just with a
modified Similarity (no coord, 1/0 tf, no idf, no norms) but you *might*
need a slightly modified TermScorer to do that.

In general though, i think you are solving your problem the wrong way ...
don't just put the movie Ids in the movie-star docs ... also have one
indexed/stored field per category of movie (ie: "horror" would be an
indexed
field) that would only be set on actors which have appeared in a movie of
that type -- the value of the field would be the number of movies they
have appeared in of that type.

now you do your main query, with a simple filter on the "horror" field to
ensure it has a value and you've got the stored value of the "horror"
field to tell you how many movies they've been in.




: Date: Thu, 27 Jul 2006 12:02:46 -0400
: From: Russell M. Allen <Ru...@aebn.net>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Scoring a document (count?)
:
: I am curious about the potential use of document scoring as a means to
: extract additional data from an index.  Specifically, I would like the
: score to be a count of how many times a particular field matched a set
: of terms.
:
: For example, I am indexing movie-stars (Each document is a movie-star).
: A movie-star has a number of fields, such as name, movies they have been
: in, etc.  I want to produce an 'index' of stars by name and show how
: many movies, which match a filter, that they have appeared in.
:
: In natural language my query might be:
: 	"List all stars who have appeared in a 'horror' movie, where
: last name starts with A, and tell me how many horror movies they were
: in."
:
: My search will look something like this:
: 	"+lastName:A* +movie:(1 7 21 58 92)"	//where movie is a
: previously computed list of 'horror' movie ids
:
: If my index contained the following documents:
:     doc1 = lastName:Anna   movie:{3 10}
:     doc2 = lastName:Aba    movie:{1 10 12}
:     doc3 = lastName:Addd   movie:{3 21 55 92}
:     doc4 = lastName:Baaa   movie:{7 56}
:
: I would like to get back:
:     doc2, score of 1	//score of 1 because only movie 1 matched
:     doc3, score of 2	//score of 2 because movies 21 and 92 matched
:
:
:
: Currently, we perform an initial query against our Star index to
: retrieve a list of stars.  Then we perform N queries against a separate
: movie index to count the number of movies that match our sub filter
: 'horror'.  This is obviously very inefficient, and as I've shown above,
: the information (count) is available during the primary query.
:
: Thoughts?
:
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring a document (count?)

Posted by Doron Cohen <DO...@il.ibm.com>.

Doron Cohen/Haifa/IBM@IBMIL wrote on 28/07/2006 00:18:47:
> For the scoring approach - I don't see an easy way to get the
> counts from the score of the results, although the TF (term
> frequency in candidate docs) is known+used during document
> scoring, and although it seems that the application can be
> arranged such that TF of search result documents would be the
> required count.

Thinking more about this, it is possible, though not very simple and so
clean. - You would need to write your own variation of TermQuery class,
something like TfTermQuery, with its own variations of Weight and Scorer
classes. This scorer can assign the raw term frequencies as the score
(disabling the scoring part that takes into account IDF and
normalization...). You can then query with your own HitCollector to collect
the raw scores. I think this would compute what you were asking for...

- Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scoring a document (count?)

Posted by Doron Cohen <DO...@il.ibm.com>.

Hi Russel, my apologies for the delayed response. I rather have all
correspondence on the mailing list, but to keep this mail thread readable I
put the files at http://cdoronc.awardspace.com/TfTermQuery . I hope it
helps you and would be interested in your comments.

Regards,
Doron

"Russell M. Allen" <Ru...@aebn.net> wrote on 04/08/2006 06:45:37:
> Doron, thanks for the code offer.  That would be great.  I was able to
> get a partial implementation working myself, but I ran into some issues
> (most of which are rooted in a lack of understanding of Lucene internals
> on my part).  I am sure I can learn a few things from your solution to
> this problem.
>
> You may email me directly with the code, or if it's small enough, post
> it to the list for posterity.
>
> Thanks again!
> -Russell



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scoring a document (count?)

Posted by "Russell M. Allen" <Ru...@aebn.net>.

Doron, thanks for the code offer.  That would be great.  I was able to
get a partial implementation working myself, but I ran into some issues
(most of which are rooted in a lack of understanding of Lucene internals
on my part).  I am sure I can learn a few things from your solution to
this problem.

You may email me directly with the code, or if it's small enough, post
it to the list for posterity.

Thanks again!
-Russell

-----Original Message-----
From: Doron Cohen [mailto:DORONC@il.ibm.com] 
Sent: Thursday, August 03, 2006 6:04 PM
To: java-user@lucene.apache.org
Subject: RE: Scoring a document (count?)

Hi Russel,

I am also interested in the internals of Lucene's ranking and how one
can/should alter the scoring. For now I was just learning from existing
code of Lucene scorers and Weights. Your question seemed interesting, so
I in fact implemented a quick scorer that would return the raw tf as a
score, as an exercise. It is not a product level implementation of
course, but if you think this will help you (?) I can share the code.
(Would have responded sooner, for my working computer went off for a few
days with a fan error...:-).

Regards,
Doron

"Russell M. Allen" <Ru...@aebn.net> wrote on 31/07/2006
07:35:50:

> Thank you for the reply Doran!  You are exactly right about the sql 
> count(*).  I need the equivalent of group by, and count().
>
> We have considered a 'joined' index where we would have a document for

> each permutation.  We discarded it (possibly prematurely) based on the

> rapid explosion in the number of documents.  In our domain, we have 
> movies as the main document type, and 5 other satellite document types

> with their own indexes: Star, Studio, Director, Series, and Category 
> (genre).  With the exception of series, a movie has a many to many 
> relationship with the other indexes.  So, with 60k movies, 20k stars, 
> 2k studios, ... The document count quickly shoots through the roof.
>
> Also, the majority of our searching is based on a single domain type, 
> such as movie.  It is only a small handful of corner cases where we 
> want what amounts to a joined query.  If we merged these indexes, we 
> would constantly have to 'roll up' the results into distinct instances

> of a type.  (The equivalent of an SQL 'group by')
>
>
> I find the parallels between the expressiveness of Lucene and SQL 
> interesting.  I'm glad to see you compared what I was looking for to 
> an sql count(*) as well.  We have a handful of indexing issues that I 
> am attempting to solve/optimize, of which performing a count(*) is 
> only one.  I also have the need to perform a JOIN across two indexes.

> I have 'ideas' about how I might go about this, but for now we are 
> fortunate enough to have fairly static data and half of the join is 
> static.  As a result we can cache a bitset filter for the results of 
> half the join and apply it to the other (dynamic) half of the join
query.
>
> Anyway, I digress...
>
> I saw you second post regarding creating a scorer.  I'd like to 
> continue down that path.  My main issue now is simply understanding 
> how lucene works under the covers enough to write the TermQuery
variant.
>
> Thanks for the help,
> Russell.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scoring a document (count?)

Posted by Doron Cohen <DO...@il.ibm.com>.

Hi Russel,

I am also interested in the internals of Lucene's ranking and how one
can/should alter the scoring. For now I was just learning from existing
code of Lucene scorers and Weights. Your question seemed interesting, so I
in fact implemented a quick scorer that would return the raw tf as a score,
as an exercise. It is not a product level implementation of course, but if
you think this will help you (?) I can share the code. (Would have
responded sooner, for my working computer went off for a few days with a
fan error...:-).

Regards,
Doron

"Russell M. Allen" <Ru...@aebn.net> wrote on 31/07/2006 07:35:50:

> Thank you for the reply Doran!  You are exactly right about the sql
> count(*).  I need the equivalent of group by, and count().
>
> We have considered a 'joined' index where we would have a document for
> each permutation.  We discarded it (possibly prematurely) based on the
> rapid explosion in the number of documents.  In our domain, we have
> movies as the main document type, and 5 other satellite document types
> with their own indexes: Star, Studio, Director, Series, and Category
> (genre).  With the exception of series, a movie has a many to many
> relationship with the other indexes.  So, with 60k movies, 20k stars, 2k
> studios, ... The document count quickly shoots through the roof.
>
> Also, the majority of our searching is based on a single domain type,
> such as movie.  It is only a small handful of corner cases where we want
> what amounts to a joined query.  If we merged these indexes, we would
> constantly have to 'roll up' the results into distinct instances of a
> type.  (The equivalent of an SQL 'group by')
>
>
> I find the parallels between the expressiveness of Lucene and SQL
> interesting.  I'm glad to see you compared what I was looking for to an
> sql count(*) as well.  We have a handful of indexing issues that I am
> attempting to solve/optimize, of which performing a count(*) is only
> one.  I also have the need to perform a JOIN across two indexes.  I have
> 'ideas' about how I might go about this, but for now we are fortunate
> enough to have fairly static data and half of the join is static.  As a
> result we can cache a bitset filter for the results of half the join and
> apply it to the other (dynamic) half of the join query.
>
> Anyway, I digress...
>
> I saw you second post regarding creating a scorer.  I'd like to continue
> down that path.  My main issue now is simply understanding how lucene
> works under the covers enough to write the TermQuery variant.
>
> Thanks for the help,
> Russell.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Scoring a document (count?)

Posted by "Russell M. Allen" <Ru...@aebn.net>.

Thank you for the reply Doran!  You are exactly right about the sql
count(*).  I need the equivalent of group by, and count().

We have considered a 'joined' index where we would have a document for
each permutation.  We discarded it (possibly prematurely) based on the
rapid explosion in the number of documents.  In our domain, we have
movies as the main document type, and 5 other satellite document types
with their own indexes: Star, Studio, Director, Series, and Category
(genre).  With the exception of series, a movie has a many to many
relationship with the other indexes.  So, with 60k movies, 20k stars, 2k
studios, ... The document count quickly shoots through the roof.

Also, the majority of our searching is based on a single domain type,
such as movie.  It is only a small handful of corner cases where we want
what amounts to a joined query.  If we merged these indexes, we would
constantly have to 'roll up' the results into distinct instances of a
type.  (The equivalent of an SQL 'group by')


I find the parallels between the expressiveness of Lucene and SQL
interesting.  I'm glad to see you compared what I was looking for to an
sql count(*) as well.  We have a handful of indexing issues that I am
attempting to solve/optimize, of which performing a count(*) is only
one.  I also have the need to perform a JOIN across two indexes.  I have
'ideas' about how I might go about this, but for now we are fortunate
enough to have fairly static data and half of the join is static.  As a
result we can cache a bitset filter for the results of half the join and
apply it to the other (dynamic) half of the join query.

Anyway, I digress...

I saw you second post regarding creating a scorer.  I'd like to continue
down that path.  My main issue now is simply understanding how lucene
works under the covers enough to write the TermQuery variant.

Thanks for the help,
Russell.


-----Original Message-----
From: Doron Cohen [mailto:DORONC@il.ibm.com] 
Sent: Friday, July 28, 2006 3:19 AM
To: java-user@lucene.apache.org
Subject: Re: Scoring a document (count?)

This task reminds me more of a count(*) sql query than a text search
query.

Assuming that using a text search engine is a pre requisite, I can think
of two approaches - basing on Lucene scoring as suggested in the
question, or a more simple approach (below).

For the scoring approach - I don't see an easy way to get the counts
from the score of the results, although the TF (term frequency in
candidate
docs) is known+used during document scoring, and although it seems that
the application can be arranged such that TF of search result documents
would be the required count.

But perhaps a more straight forward solution can do - adding a Lucene
document for each star-movie pair. This would also allow easy update
when a new movie arrives: just add a document for each "star" in that
movie. A document can have these fields:
   StarFirstName - stored, untokenized
   StarLastName - stored, untokenized
   MovieName - stored, tokenized
   MovieType - stored, untokenized - this is the pre-computed type
mentioned below
   MovieProps  - unstored, tokenized - the word "horror" can appear in
this field, avoiding a pre-computation step.
Now a single search can do all the work:
   +StarLastName:A* +MovieProps:horror
Sorting results by StarLastName would group all results of same "star"
and also allow to count them for each star.

This would create more documents in the index -  #stars * |#movies per
star| - so there may be performance considerations, depending on the 
star| volume
of the data...

Regards,
Doron

"Russell M. Allen" <Ru...@aebn.net> wrote on 27/07/2006
09:02:46:

> I am curious about the potential use of document scoring as a means to

> extract additional data from an index.  Specifically, I would like the

> score to be a count of how many times a particular field matched a set

> of terms.
>
> For example, I am indexing movie-stars (Each document is a
movie-star).
> A movie-star has a number of fields, such as name, movies they have 
> been in, etc.  I want to produce an 'index' of stars by name and show 
> how many movies, which match a filter, that they have appeared in.
>
> In natural language my query might be:
>    "List all stars who have appeared in a 'horror' movie, where last 
> name starts with A, and tell me how many horror movies they were in."
>
> My search will look something like this:
>    "+lastName:A* +movie:(1 7 21 58 92)"   //where movie is a
> previously computed list of 'horror' movie ids
>
> If my index contained the following documents:
>     doc1 = lastName:Anna   movie:{3 10}
>     doc2 = lastName:Aba    movie:{1 10 12}
>     doc3 = lastName:Addd   movie:{3 21 55 92}
>     doc4 = lastName:Baaa   movie:{7 56}
>
> I would like to get back:
>     doc2, score of 1   //score of 1 because only movie 1 matched
>     doc3, score of 2   //score of 2 because movies 21 and 92 matched
>
>
>
> Currently, we perform an initial query against our Star index to 
> retrieve a list of stars.  Then we perform N queries against a 
> separate movie index to count the number of movies that match our sub 
> filter 'horror'.  This is obviously very inefficient, and as I've 
> shown above, the information (count) is available during the primary
query.
>
> Thoughts?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring a document (count?)

Posted by Doron Cohen <DO...@il.ibm.com>.

This task reminds me more of a count(*) sql query than a text search query.

Assuming that using a text search engine is a pre requisite, I can think of
two approaches - basing on Lucene scoring as suggested in the question, or
a more simple approach (below).

For the scoring approach - I don't see an easy way to get the counts from
the score of the results, although the TF (term frequency in candidate
docs) is known+used during document scoring, and although it seems that the
application can be arranged such that TF of search result documents would
be the required count.

But perhaps a more straight forward solution can do - adding a Lucene
document for each star-movie pair. This would also allow easy update when a
new movie arrives: just add a document for each "star" in that movie. A
document can have these fields:
   StarFirstName - stored, untokenized
   StarLastName - stored, untokenized
   MovieName - stored, tokenized
   MovieType - stored, untokenized - this is the pre-computed type
mentioned below
   MovieProps  - unstored, tokenized - the word "horror" can appear in this
field, avoiding a pre-computation step.
Now a single search can do all the work:
   +StarLastName:A* +MovieProps:horror
Sorting results by StarLastName would group all results of same "star" and
also allow to count them for each star.

This would create more documents in the index -  #stars * |#movies per
star| - so there may be performance considerations, depending on the volume
of the data...

Regards,
Doron

"Russell M. Allen" <Ru...@aebn.net> wrote on 27/07/2006 09:02:46:

> I am curious about the potential use of document scoring as a means to
> extract additional data from an index.  Specifically, I would like the
> score to be a count of how many times a particular field matched a set
> of terms.
>
> For example, I am indexing movie-stars (Each document is a movie-star).
> A movie-star has a number of fields, such as name, movies they have been
> in, etc.  I want to produce an 'index' of stars by name and show how
> many movies, which match a filter, that they have appeared in.
>
> In natural language my query might be:
>    "List all stars who have appeared in a 'horror' movie, where
> last name starts with A, and tell me how many horror movies they were
> in."
>
> My search will look something like this:
>    "+lastName:A* +movie:(1 7 21 58 92)"   //where movie is a
> previously computed list of 'horror' movie ids
>
> If my index contained the following documents:
>     doc1 = lastName:Anna   movie:{3 10}
>     doc2 = lastName:Aba    movie:{1 10 12}
>     doc3 = lastName:Addd   movie:{3 21 55 92}
>     doc4 = lastName:Baaa   movie:{7 56}
>
> I would like to get back:
>     doc2, score of 1   //score of 1 because only movie 1 matched
>     doc3, score of 2   //score of 2 because movies 21 and 92 matched
>
>
>
> Currently, we perform an initial query against our Star index to
> retrieve a list of stars.  Then we perform N queries against a separate
> movie index to count the number of movies that match our sub filter
> 'horror'.  This is obviously very inefficient, and as I've shown above,
> the information (count) is available during the primary query.
>
> Thoughts?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org