You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by karthik c <ka...@gmail.com> on 2009/03/16 07:41:02 UTC

search on field to get distinct values with counts....

Hi,

We have a requirement to fetch a set of distinct values of a given field
that match the given query. We also need to fetch the number of items
associated with each field value. I figured out a way to do this for
single-valued fields but am not able to get it to work for multi-valued
fields.

Long Story:
Say you have an index of movies, I would like to get a unique set of
directors matching a query (say "john") and also the number of movies
directed by each of them. For this example lets assume that "director" is a
single valued field.

I came up with one approach to implement this: Search for the query string
in the director field and then apply faceting on the same field (director).
The search will limit the movie results to the ones directed by directors
matching the query. Further, the faceting will provide a unique set of
directors and also the count of movies associated with them. The query will
look something like this:
solr/Movie/select/?q=director:(john)&start=0&rows=0&facet=true&facet.field=raw_director

This query works fine for single-valued fields. However it does not work in
the case of multi-valued fields, say we perform a similar search on the
"actors" (mutli-valued) field, the query will look like:
solr/Movie/select/?q=actors:(john)&start=0&rows=0&facet=true&facet.field=raw_actors
In this case, the search will again limit the movie results to the ones in
which actors matching the query have acted in. However while faceting the
results on "actors", the facet results will also contain other actors that
have acted in the resulting movies. For eg: say we are searching for
actors:malkovich, this will return all movies in which John Malkovich has
acted in. When the faceting is applied on these results, the facet results
contain John Malkovich with the correct number of movies. But, the facet
results also contain other actors who have acted with John Malkovich. The
facet results for the above query look something like this:
<lst name="facet_fields">
    <lst name="raw_actors">
        <int name="John Malkovich">49</int>
        <int name="Catherine Deneuve">4</int>
        <int name="John Cusack">3</int>
        <int name="Angelina Jolie">2</int>
        <int name="Evangeline Lilly">2</int>
        <int name="Glenne Headly">2</int>
        <int name="Jeremy Irons">2</int>
        <int name="Ray Winstone">2</int>
    </lst>
</lst>
The other actors in the above results is obviously not what we expect to
see, since they do match the original query (i.e. malkovich).

Is there any other way I can approach this for multi-valued fields ?

Thanks,
karthik c
http://cantspellathing.blogspot.com

Re: search on field to get distinct values with counts....

Posted by karthik c <ka...@gmail.com>.
Thanks Erik... Can we enable highlighting for facet results as well ? I am
using Solr's faceting feature to get a unique set of results for the field
with counts, so unless highlighting works for facet results, it will not
really be useful.

karthik c
http://cantspellathing.blogspot.com


On Mon, Mar 16, 2009 at 9:18 PM, Erik Hatcher <er...@ehatchersolutions.com>wrote:

> Perhaps the Highlighting feature will help?  You could use that to see
> which docs have something highlighted for the field in question.
>
>        Erik
>
>
> On Mar 16, 2009, at 10:12 AM, karthik c wrote:
>
>  Thanks Otis... What kind of post-processing are u talking about here ? Is
>> there any mechanism in Solr to identify which of the facet results match
>> the
>> query ?
>>
>> karthik c
>> http://cantspellathing.blogspot.com
>>
>>
>> On Mon, Mar 16, 2009 at 6:57 PM, Otis Gospodnetic <
>> otis_gospodnetic@yahoo.com> wrote:
>>
>>
>>> If your searches are simplistic and Shalin's suggestion is not an option
>>> for you for some reason, perhaps even something as simple as
>>> post-processing/filtering of returned facets will work.
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> ----- Original Message ----
>>>
>>>> From: karthik c <ka...@gmail.com>
>>>> To: solr-user@lucene.apache.org
>>>> Sent: Monday, March 16, 2009 2:41:02 AM
>>>> Subject: search on field to get distinct values with counts....
>>>>
>>>> Hi,
>>>>
>>>> We have a requirement to fetch a set of distinct values of a given field
>>>> that match the given query. We also need to fetch the number of items
>>>> associated with each field value. I figured out a way to do this for
>>>> single-valued fields but am not able to get it to work for multi-valued
>>>> fields.
>>>>
>>>> Long Story:
>>>> Say you have an index of movies, I would like to get a unique set of
>>>> directors matching a query (say "john") and also the number of movies
>>>> directed by each of them. For this example lets assume that "director"
>>>> is
>>>>
>>> a
>>>
>>>> single valued field.
>>>>
>>>> I came up with one approach to implement this: Search for the query
>>>>
>>> string
>>>
>>>> in the director field and then apply faceting on the same field
>>>>
>>> (director).
>>>
>>>> The search will limit the movie results to the ones directed by
>>>> directors
>>>> matching the query. Further, the faceting will provide a unique set of
>>>> directors and also the count of movies associated with them. The query
>>>>
>>> will
>>>
>>>> look something like this:
>>>>
>>>> solr/Movie/select/?q=director:(john)&start=0&rows=0&facet=true&facet.field=raw_director
>>>
>>>>
>>>> This query works fine for single-valued fields. However it does not work
>>>>
>>> in
>>>
>>>> the case of multi-valued fields, say we perform a similar search on the
>>>> "actors" (mutli-valued) field, the query will look like:
>>>>
>>>> solr/Movie/select/?q=actors:(john)&start=0&rows=0&facet=true&facet.field=raw_actors
>>>
>>>> In this case, the search will again limit the movie results to the ones
>>>>
>>> in
>>>
>>>> which actors matching the query have acted in. However while faceting
>>>> the
>>>> results on "actors", the facet results will also contain other actors
>>>>
>>> that
>>>
>>>> have acted in the resulting movies. For eg: say we are searching for
>>>> actors:malkovich, this will return all movies in which John Malkovich
>>>> has
>>>> acted in. When the faceting is applied on these results, the facet
>>>>
>>> results
>>>
>>>> contain John Malkovich with the correct number of movies. But, the facet
>>>> results also contain other actors who have acted with John Malkovich.
>>>> The
>>>> facet results for the above query look something like this:
>>>>
>>>>
>>>>       49
>>>>       4
>>>>       3
>>>>       2
>>>>       2
>>>>       2
>>>>       2
>>>>       2
>>>>
>>>>
>>>> The other actors in the above results is obviously not what we expect to
>>>> see, since they do match the original query (i.e. malkovich).
>>>>
>>>> Is there any other way I can approach this for multi-valued fields ?
>>>>
>>>> Thanks,
>>>> karthik c
>>>> http://cantspellathing.blogspot.com
>>>>
>>>
>>>
>>>
>

Re: search on field to get distinct values with counts....

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Perhaps the Highlighting feature will help?  You could use that to see  
which docs have something highlighted for the field in question.

	Erik

On Mar 16, 2009, at 10:12 AM, karthik c wrote:

> Thanks Otis... What kind of post-processing are u talking about  
> here ? Is
> there any mechanism in Solr to identify which of the facet results  
> match the
> query ?
>
> karthik c
> http://cantspellathing.blogspot.com
>
>
> On Mon, Mar 16, 2009 at 6:57 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>
>>
>> If your searches are simplistic and Shalin's suggestion is not an  
>> option
>> for you for some reason, perhaps even something as simple as
>> post-processing/filtering of returned facets will work.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>>> From: karthik c <ka...@gmail.com>
>>> To: solr-user@lucene.apache.org
>>> Sent: Monday, March 16, 2009 2:41:02 AM
>>> Subject: search on field to get distinct values with counts....
>>>
>>> Hi,
>>>
>>> We have a requirement to fetch a set of distinct values of a given  
>>> field
>>> that match the given query. We also need to fetch the number of  
>>> items
>>> associated with each field value. I figured out a way to do this for
>>> single-valued fields but am not able to get it to work for multi- 
>>> valued
>>> fields.
>>>
>>> Long Story:
>>> Say you have an index of movies, I would like to get a unique set of
>>> directors matching a query (say "john") and also the number of  
>>> movies
>>> directed by each of them. For this example lets assume that  
>>> "director" is
>> a
>>> single valued field.
>>>
>>> I came up with one approach to implement this: Search for the query
>> string
>>> in the director field and then apply faceting on the same field
>> (director).
>>> The search will limit the movie results to the ones directed by  
>>> directors
>>> matching the query. Further, the faceting will provide a unique  
>>> set of
>>> directors and also the count of movies associated with them. The  
>>> query
>> will
>>> look something like this:
>>>
>> solr/Movie/select/?q=director: 
>> (john)&start=0&rows=0&facet=true&facet.field=raw_director
>>>
>>> This query works fine for single-valued fields. However it does  
>>> not work
>> in
>>> the case of multi-valued fields, say we perform a similar search  
>>> on the
>>> "actors" (mutli-valued) field, the query will look like:
>>>
>> solr/Movie/select/?q=actors: 
>> (john)&start=0&rows=0&facet=true&facet.field=raw_actors
>>> In this case, the search will again limit the movie results to the  
>>> ones
>> in
>>> which actors matching the query have acted in. However while  
>>> faceting the
>>> results on "actors", the facet results will also contain other  
>>> actors
>> that
>>> have acted in the resulting movies. For eg: say we are searching for
>>> actors:malkovich, this will return all movies in which John  
>>> Malkovich has
>>> acted in. When the faceting is applied on these results, the facet
>> results
>>> contain John Malkovich with the correct number of movies. But, the  
>>> facet
>>> results also contain other actors who have acted with John  
>>> Malkovich. The
>>> facet results for the above query look something like this:
>>>
>>>
>>>        49
>>>        4
>>>        3
>>>        2
>>>        2
>>>        2
>>>        2
>>>        2
>>>
>>>
>>> The other actors in the above results is obviously not what we  
>>> expect to
>>> see, since they do match the original query (i.e. malkovich).
>>>
>>> Is there any other way I can approach this for multi-valued fields ?
>>>
>>> Thanks,
>>> karthik c
>>> http://cantspellathing.blogspot.com
>>
>>


Re: search on field to get distinct values with counts....

Posted by karthik c <ka...@gmail.com>.
Thanks Otis... What kind of post-processing are u talking about here ? Is
there any mechanism in Solr to identify which of the facet results match the
query ?

karthik c
http://cantspellathing.blogspot.com


On Mon, Mar 16, 2009 at 6:57 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

>
> If your searches are simplistic and Shalin's suggestion is not an option
> for you for some reason, perhaps even something as simple as
> post-processing/filtering of returned facets will work.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: karthik c <ka...@gmail.com>
> > To: solr-user@lucene.apache.org
> > Sent: Monday, March 16, 2009 2:41:02 AM
> > Subject: search on field to get distinct values with counts....
> >
> > Hi,
> >
> > We have a requirement to fetch a set of distinct values of a given field
> > that match the given query. We also need to fetch the number of items
> > associated with each field value. I figured out a way to do this for
> > single-valued fields but am not able to get it to work for multi-valued
> > fields.
> >
> > Long Story:
> > Say you have an index of movies, I would like to get a unique set of
> > directors matching a query (say "john") and also the number of movies
> > directed by each of them. For this example lets assume that "director" is
> a
> > single valued field.
> >
> > I came up with one approach to implement this: Search for the query
> string
> > in the director field and then apply faceting on the same field
> (director).
> > The search will limit the movie results to the ones directed by directors
> > matching the query. Further, the faceting will provide a unique set of
> > directors and also the count of movies associated with them. The query
> will
> > look something like this:
> >
> solr/Movie/select/?q=director:(john)&start=0&rows=0&facet=true&facet.field=raw_director
> >
> > This query works fine for single-valued fields. However it does not work
> in
> > the case of multi-valued fields, say we perform a similar search on the
> > "actors" (mutli-valued) field, the query will look like:
> >
> solr/Movie/select/?q=actors:(john)&start=0&rows=0&facet=true&facet.field=raw_actors
> > In this case, the search will again limit the movie results to the ones
> in
> > which actors matching the query have acted in. However while faceting the
> > results on "actors", the facet results will also contain other actors
> that
> > have acted in the resulting movies. For eg: say we are searching for
> > actors:malkovich, this will return all movies in which John Malkovich has
> > acted in. When the faceting is applied on these results, the facet
> results
> > contain John Malkovich with the correct number of movies. But, the facet
> > results also contain other actors who have acted with John Malkovich. The
> > facet results for the above query look something like this:
> >
> >
> >         49
> >         4
> >         3
> >         2
> >         2
> >         2
> >         2
> >         2
> >
> >
> > The other actors in the above results is obviously not what we expect to
> > see, since they do match the original query (i.e. malkovich).
> >
> > Is there any other way I can approach this for multi-valued fields ?
> >
> > Thanks,
> > karthik c
> > http://cantspellathing.blogspot.com
>
>

Re: search on field to get distinct values with counts....

Posted by Otis Gospodnetic <ot...@yahoo.com>.
If your searches are simplistic and Shalin's suggestion is not an option for you for some reason, perhaps even something as simple as post-processing/filtering of returned facets will work.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: karthik c <ka...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Monday, March 16, 2009 2:41:02 AM
> Subject: search on field to get distinct values with counts....
> 
> Hi,
> 
> We have a requirement to fetch a set of distinct values of a given field
> that match the given query. We also need to fetch the number of items
> associated with each field value. I figured out a way to do this for
> single-valued fields but am not able to get it to work for multi-valued
> fields.
> 
> Long Story:
> Say you have an index of movies, I would like to get a unique set of
> directors matching a query (say "john") and also the number of movies
> directed by each of them. For this example lets assume that "director" is a
> single valued field.
> 
> I came up with one approach to implement this: Search for the query string
> in the director field and then apply faceting on the same field (director).
> The search will limit the movie results to the ones directed by directors
> matching the query. Further, the faceting will provide a unique set of
> directors and also the count of movies associated with them. The query will
> look something like this:
> solr/Movie/select/?q=director:(john)&start=0&rows=0&facet=true&facet.field=raw_director
> 
> This query works fine for single-valued fields. However it does not work in
> the case of multi-valued fields, say we perform a similar search on the
> "actors" (mutli-valued) field, the query will look like:
> solr/Movie/select/?q=actors:(john)&start=0&rows=0&facet=true&facet.field=raw_actors
> In this case, the search will again limit the movie results to the ones in
> which actors matching the query have acted in. However while faceting the
> results on "actors", the facet results will also contain other actors that
> have acted in the resulting movies. For eg: say we are searching for
> actors:malkovich, this will return all movies in which John Malkovich has
> acted in. When the faceting is applied on these results, the facet results
> contain John Malkovich with the correct number of movies. But, the facet
> results also contain other actors who have acted with John Malkovich. The
> facet results for the above query look something like this:
> 
>     
>         49
>         4
>         3
>         2
>         2
>         2
>         2
>         2
>     
> 
> The other actors in the above results is obviously not what we expect to
> see, since they do match the original query (i.e. malkovich).
> 
> Is there any other way I can approach this for multi-valued fields ?
> 
> Thanks,
> karthik c
> http://cantspellathing.blogspot.com


Re: search on field to get distinct values with counts....

Posted by karthik c <ka...@gmail.com>.
Thanks for reading through the long question and providing suggestions
Shalin  :)
You are right about the results being correct. The problem is surely caused
because of the approach used.

I guess having different type of documents (for movies, for actors, etc.)
will help. However with this approach, I will have to pre-compute and index
the number of movies associated with each actor as well. I will need to do
this for the other fields as well. Do let me know if you any other
suggestions/approaches as well.

Thanks,
karthik c
http://cantspellathing.blogspot.com


On Mon, Mar 16, 2009 at 12:37 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> On Mon, Mar 16, 2009 at 12:11 PM, karthik c <ka...@gmail.com> wrote:
>
> > Hi,
> >
> > We have a requirement to fetch a set of distinct values of a given field
> > that match the given query. We also need to fetch the number of items
> > associated with each field value. I figured out a way to do this for
> > single-valued fields but am not able to get it to work for multi-valued
> > fields.
> >
> > Long Story:
> > Say you have an index of movies, I would like to get a unique set of
> > directors matching a query (say "john") and also the number of movies
> > directed by each of them. For this example lets assume that "director" is
> a
> > single valued field.
> >
> > I came up with one approach to implement this: Search for the query
> string
> > in the director field and then apply faceting on the same field
> (director).
> > The search will limit the movie results to the ones directed by directors
> > matching the query. Further, the faceting will provide a unique set of
> > directors and also the count of movies associated with them. The query
> will
> > look something like this:
> >
> >
> solr/Movie/select/?q=director:(john)&start=0&rows=0&facet=true&facet.field=raw_director
> >
> > This query works fine for single-valued fields. However it does not work
> in
> > the case of multi-valued fields, say we perform a similar search on the
> > "actors" (mutli-valued) field, the query will look like:
> >
> >
> solr/Movie/select/?q=actors:(john)&start=0&rows=0&facet=true&facet.field=raw_actors
> > In this case, the search will again limit the movie results to the ones
> in
> > which actors matching the query have acted in. However while faceting the
> > results on "actors", the facet results will also contain other actors
> that
> > have acted in the resulting movies. For eg: say we are searching for
> > actors:malkovich, this will return all movies in which John Malkovich has
> > acted in. When the faceting is applied on these results, the facet
> results
> > contain John Malkovich with the correct number of movies. But, the facet
> > results also contain other actors who have acted with John Malkovich. The
> > facet results for the above query look something like this:
> > <lst name="facet_fields">
> >    <lst name="raw_actors">
> >        <int name="John Malkovich">49</int>
> >        <int name="Catherine Deneuve">4</int>
> >        <int name="John Cusack">3</int>
> >        <int name="Angelina Jolie">2</int>
> >        <int name="Evangeline Lilly">2</int>
> >        <int name="Glenne Headly">2</int>
> >        <int name="Jeremy Irons">2</int>
> >        <int name="Ray Winstone">2</int>
> >    </lst>
> > </lst>
> > The other actors in the above results is obviously not what we expect to
> > see, since they do match the original query (i.e. malkovich).
> >
>
> Note that a document in your index represents a movie. You are actually
> searching for movies and not actors. Looking from that perspective, the
> results are correct.
>
> You may need to re-think your schema. Make a document represent what you
> want to search. Perhaps have different types of documents for 'actors',
> 'movies' etc.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: search on field to get distinct values with counts....

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Mon, Mar 16, 2009 at 12:11 PM, karthik c <ka...@gmail.com> wrote:

> Hi,
>
> We have a requirement to fetch a set of distinct values of a given field
> that match the given query. We also need to fetch the number of items
> associated with each field value. I figured out a way to do this for
> single-valued fields but am not able to get it to work for multi-valued
> fields.
>
> Long Story:
> Say you have an index of movies, I would like to get a unique set of
> directors matching a query (say "john") and also the number of movies
> directed by each of them. For this example lets assume that "director" is a
> single valued field.
>
> I came up with one approach to implement this: Search for the query string
> in the director field and then apply faceting on the same field (director).
> The search will limit the movie results to the ones directed by directors
> matching the query. Further, the faceting will provide a unique set of
> directors and also the count of movies associated with them. The query will
> look something like this:
>
> solr/Movie/select/?q=director:(john)&start=0&rows=0&facet=true&facet.field=raw_director
>
> This query works fine for single-valued fields. However it does not work in
> the case of multi-valued fields, say we perform a similar search on the
> "actors" (mutli-valued) field, the query will look like:
>
> solr/Movie/select/?q=actors:(john)&start=0&rows=0&facet=true&facet.field=raw_actors
> In this case, the search will again limit the movie results to the ones in
> which actors matching the query have acted in. However while faceting the
> results on "actors", the facet results will also contain other actors that
> have acted in the resulting movies. For eg: say we are searching for
> actors:malkovich, this will return all movies in which John Malkovich has
> acted in. When the faceting is applied on these results, the facet results
> contain John Malkovich with the correct number of movies. But, the facet
> results also contain other actors who have acted with John Malkovich. The
> facet results for the above query look something like this:
> <lst name="facet_fields">
>    <lst name="raw_actors">
>        <int name="John Malkovich">49</int>
>        <int name="Catherine Deneuve">4</int>
>        <int name="John Cusack">3</int>
>        <int name="Angelina Jolie">2</int>
>        <int name="Evangeline Lilly">2</int>
>        <int name="Glenne Headly">2</int>
>        <int name="Jeremy Irons">2</int>
>        <int name="Ray Winstone">2</int>
>    </lst>
> </lst>
> The other actors in the above results is obviously not what we expect to
> see, since they do match the original query (i.e. malkovich).
>

Note that a document in your index represents a movie. You are actually
searching for movies and not actors. Looking from that perspective, the
results are correct.

You may need to re-think your schema. Make a document represent what you
want to search. Perhaps have different types of documents for 'actors',
'movies' etc.

-- 
Regards,
Shalin Shekhar Mangar.