You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Corey Tisdale <co...@shopperschoice.com> on 2006/03/08 00:50:54 UTC

metadata about result sets?

Hi all,

I just found out about solr, so naturally I have to play with it. I  
hate to ask anything as basic as this, but the lack of docs makes me  
hafta. Is there any way to get metadata about a search result off of  
this bad dog? I am trying to find  a good way to search through  
several million items, and I think that being able to aggregate data  
about the results would help people refine the search, so if they  
search for "legal" I could show a result set with 22000 books but  
then also ask if they meant thrillers or test prep or career  
planning, etc.

On a side note, I fell dirty now. Thanks for putting up with noobness.


Corey

Re: metadata about result sets?

Posted by Chris Hostetter <ho...@fucit.org>.
: facted metadata. The way that we build the metatdata list is post-
: indexing of product, we would actual build a bit sequence that
: corresponds to all possible key/value combos for each product and
: associate each variation with the product. Then wehn someone refines

Hmmm... you can do something similar in solr by leveraging the Filter
caching, but you don't need to build a DocSet (bit sequence) for every
combination of values -- just one per value.  When you want to know what
the result is from combining multiple facets, you intersect the DocSets.

: For the schema, I just meant the document format. the file is called
: schema.xml. I haven't tried it, but it looks like you can change that
: to affect the way solr works without actually affecting the way
: lucene handles it. Is that wrong? I guess it doesn't really matter,

well, the schema.xml lets you define the fields, and what options you want
those fields to have -- but the set of options is fixed (and tied to the
set of options available in lucene).  You could have a "suggestable" field
in every doc which contained a list of field names -- but there's really
no way to annotate fields directly.

: As for 'scanning the resultset', I can see how I was a little shy on
: the details. Sorry about that. I meant look through the results to
: see what facets apply to the resultset. So if my company sells books
: and power tools, when someone searches for 'the little engine who
: could', once we know there are no power tools in the result set, I
: don't show the refinement facets for power tool metadata (like

Ah... right.  Depending on how you go about it, scanning result sets to
find out which facets apply is probably just as expensive as just testing
the facet to see if hte ocunt is posative -- especially if you have the
RAM to allocate a big Filter cache -- then you can have a DocSet for every
Filter in memory all of the time, and testing a few thousand can be done
near instantly.

:   In your example file, how does the name facet know to display only
: the names that start with whatever intial was selected?   Would that

Honestly, i haven't really thought about the mechanisms for dealing with
facets in that way ... it would be trivial just to let all of the names be
tested -- but that would obviously involve a lot of unneccessary
computation, so if you could configure subgroups that were only consulted
once another facet had been used, that would obviously make more sense.
Perhaps something like...

      <group id="initial" label="Author">
        <facet id="a" label="A" query="author:a*">
          <group id="name" label="Author">
            <facet use-prefix-terms-field="author">a</facet>
          </group>
        </facet>
        <facet id="b" label="B" query="author:b*">
          <group id="name" label="Author">
            <facet use-prefix-terms-field="author">b</facet>
          </group>
        </facet>
        ...

..allthough obviously common group/subgroup relationships like prefix
expansion could be done as a special case to make it shorter to express...

      <!-- 'mfg' facet starts with initial,
            then full names that match initial -->
      <group id="mfg" label="Manufacturer">
        <prefix-facet id="initial" field="mfg" prefix-chars="1">
          <term-value-facet id="mfg" />
        </prefix-facet>
      </group>
      <!-- 'name' facet starts with initial, then 2 char prefix,
            then full names that match prefix -->
      <group id="name" label="Author">
        <prefix-facet id="initial" field="name" prefix-chars="1">
          <prefix-facet id="i_prefix" field="name" prefix-chars="2">
            <term-value-facet id="name" />
          </prefix-facet>
        </prefix-facet>
      </group>

: I think what I am starting to understand is that coming from what we
: have (a rdbms based metadata gathering system), I need to rethink my
: process. Ive spent so much time training myself to think in terms of
: how to make things fast in mysql that I need to re-open my mind :)

yeah ... inverted indexes are a completey different beast from relational
databases.  i would definitely suggest reading up on lucene, and getting
to know the basics -- keeping in mind that anything that can be done in
lucene can be done in a solr plugin, solr just makes it easier and gives
you "wicked cool" caching :)



-Hoss


Re: metadata about result sets?

Posted by Corey Tisdale <co...@shopperschoice.com>.
Interesting... I have been looking at lucene in my spare time at work  
for all of 3 days now, so I have to apologize for my lack of  
understanding when it comes to how it works specifically :)  We have  
a terrrible internal search that we are looking to replace, and the  
only thing it does well is help you refine a terrible resultset with  
facted metadata. The way that we build the metatdata list is post- 
indexing of product, we would actual build a bit sequence that  
corresponds to all possible key/value combos for each product and  
associate each variation with the product. Then wehn someone refines  
with the price range, lets say, we just look for the one that  
matches. It gets a bit crazy, but it is the only way we could get the  
speed down for millions of documents in the index. I did read that  
email from cnet the other day, but it didn't really register what  
they were talking about until I saw your metadata group xml exmaple  
doc here.

For the schema, I just meant the document format. the file is called  
schema.xml. I haven't tried it, but it looks like you can change that  
to affect the way solr works without actually affecting the way  
lucene handles it. Is that wrong? I guess it doesn't really matter,  
since it looks like your indexable groups make more sense from a  
maintainability standpoint (less redundant data).

As for 'scanning the resultset', I can see how I was a little shy on  
the details. Sorry about that. I meant look through the results to  
see what facets apply to the resultset. So if my company sells books  
and power tools, when someone searches for 'the little engine who  
could', once we know there are no power tools in the result set, I  
don't show the refinement facets for power tool metadata (like  
wattage or battery operated or blade size). For a big group of  
diverse data, we could potentially have several hundred group names,  
and it seems like it might be redundent to search for 300 metatdata  
types when we know that only 5 apply to the resultset. If, however,  
speed is not impacted noticably by searching for metadata that does  
not exist, then we don't need to worry about this. I am not familir  
enough with lucene's performance to know which would be more optimal.

  In your example file, how does the name facet know to display only  
the names that start with whatever intial was selected?   Would that  
be built in by modifying our result set first (by applying the  
author:a from the initial metadata group) then letting it gather all  
author names on the new set? That seems the easiset way to me, but I  
don't know how the would affect speed with lucene.

I think what I am starting to understand is that coming from what we  
have (a rdbms based metadata gathering system), I need to rethink my  
process. Ive spent so much time training myself to think in terms of  
how to make things fast in mysql that I need to re-open my mind :)

-Corey

On Mar 10, 2006, at 12:44 AM, Chris Hostetter wrote:

>
> : I like the idea of the wiki page; I think I will attempt to set one
> : up after this email, but I wanted to see if I could do a little bit
> : better job of fleshing out how pulling metadata out might work  
> (in my
>
> I finally got a chance to look at your ideas.
>
> first off: as far as i know, there isn't any spcial edit permissions
> neccessary to modify the TaskList ... if the edit link wasn't  
> showing up
> for you after you logged in, it might just be that the page was  
> cached,
> try a force-reload.
>
> Okay, on to the topic at hand..
>
> : We add suggestable metadata as part of the product schema, so we
> : could have something like
>
> There's a difference between the index schema, and the "xml schema/ 
> dtd"
> for adding documents.  You seem to be suggesting a change to the  
> xml used
> when adding documents to indicate wether a field should be  
> suggestable or
> not, but that syntax is tied directly to the underlyng lucene API for
> Documents/Fields -- where would the suggestable/preceding info be  
> stored?
>
> : Once we reindex, we do a search for 'legal' again and our book is in
> : it. Based on our index,  we can scan the resultset and see that the
> : results have three suggestable fields, two of which do not require a
> : preceding field.
>
> I'm not sure what you mean by "scan the result" to get to get the
> suggestable (and their values) ... can you elaborate?
>
>
> I'm not sure if you read the thread yonik mentioned earlier about  
> how we
> do this at CNET, but the way we store info about which fields we  
> want to
> have facets on (and what those facets should be in the case of range
> queries and such) is to put "metadata documents" into the index.   
> for a
> single user request, you pull out the metadata document, then use  
> the info
> contained in it to determine facets to search on and intersect with  
> the
> main result.
>
> the format of hte metadata docs we use is very custom, but perhaps a
> similar, generalized approach could be implimented?
>
> The plugin could dictate a specific XML format indicating the  
> behavior to
> drive the facets using either of hte following mechanisms (more  
> could be
> added as needed)...
>   * make group FF of all indexed values of field F
>   * make group G using queries x, y, and z with labels a, b, and c
> ...users could index one or more metadata documents, containing the  
> XML
> info in any stored field they want defined in the schema -- when
> configuring the plugin, they'd specify the field in the  
> solrconfig.xml.
> at query time, they specify two queries: one to restrict the main  
> results,
> and one to identify the metadata doc they want to use (if it's  
> allways the
> same one, a defualt could be configured in solrconfig as well)
>
> an example of what i mean about XML stored in a field of the metadata
> doc...
>
>    <facets>
>      <group id="price" label="Price">
>        <facet id="0-20"  label="Under $20">price:[0 TO 20]</facet>
>        <facet id="21-40" label="$21 - $40">price:[21 TO 40]</facet>
>        <facet id="41-60" label="$41 - $60">price:[41 TO 60]</facet>
>      </group>
>      <group id="initial" label="Author">
>        <facet id="a" label="A">author:a*</facet>
>        ...
>      </group>
>      <group id="name" label="Author" depends="initial">
>        <facet use-terms-field="author" />
>      </group>
>      ...
>    </facets>
>
>
> -Hoss
>


Re: metadata about result sets?

Posted by Chris Hostetter <ho...@fucit.org>.
: I like the idea of the wiki page; I think I will attempt to set one
: up after this email, but I wanted to see if I could do a little bit
: better job of fleshing out how pulling metadata out might work (in my

I finally got a chance to look at your ideas.

first off: as far as i know, there isn't any spcial edit permissions
neccessary to modify the TaskList ... if the edit link wasn't showing up
for you after you logged in, it might just be that the page was cached,
try a force-reload.

Okay, on to the topic at hand..

: We add suggestable metadata as part of the product schema, so we
: could have something like

There's a difference between the index schema, and the "xml schema/dtd"
for adding documents.  You seem to be suggesting a change to the xml used
when adding documents to indicate wether a field should be suggestable or
not, but that syntax is tied directly to the underlyng lucene API for
Documents/Fields -- where would the suggestable/preceding info be stored?

: Once we reindex, we do a search for 'legal' again and our book is in
: it. Based on our index,  we can scan the resultset and see that the
: results have three suggestable fields, two of which do not require a
: preceding field.

I'm not sure what you mean by "scan the result" to get to get the
suggestable (and their values) ... can you elaborate?


I'm not sure if you read the thread yonik mentioned earlier about how we
do this at CNET, but the way we store info about which fields we want to
have facets on (and what those facets should be in the case of range
queries and such) is to put "metadata documents" into the index.  for a
single user request, you pull out the metadata document, then use the info
contained in it to determine facets to search on and intersect with the
main result.

the format of hte metadata docs we use is very custom, but perhaps a
similar, generalized approach could be implimented?

The plugin could dictate a specific XML format indicating the behavior to
drive the facets using either of hte following mechanisms (more could be
added as needed)...
  * make group FF of all indexed values of field F
  * make group G using queries x, y, and z with labels a, b, and c
...users could index one or more metadata documents, containing the XML
info in any stored field they want defined in the schema -- when
configuring the plugin, they'd specify the field in the solrconfig.xml.
at query time, they specify two queries: one to restrict the main results,
and one to identify the metadata doc they want to use (if it's allways the
same one, a defualt could be configured in solrconfig as well)

an example of what i mean about XML stored in a field of the metadata
doc...

   <facets>
     <group id="price" label="Price">
       <facet id="0-20"  label="Under $20">price:[0 TO 20]</facet>
       <facet id="21-40" label="$21 - $40">price:[21 TO 40]</facet>
       <facet id="41-60" label="$41 - $60">price:[41 TO 60]</facet>
     </group>
     <group id="initial" label="Author">
       <facet id="a" label="A">author:a*</facet>
       ...
     </group>
     <group id="name" label="Author" depends="initial">
       <facet use-terms-field="author" />
     </group>
     ...
   </facets>


-Hoss


Re: metadata about result sets?

Posted by Corey Tisdale <co...@shopperschoice.com>.
I like the idea of the wiki page; I think I will attempt to set one  
up after this email, but I wanted to see if I could do a little bit  
better job of fleshing out how pulling metadata out might work (in my  
mind):

Scenario 1:
------------------------------------------------------------------------ 
----------------------------------------------------------------
We add suggestable metadata as part of the product schema, so we  
could have something like
<add>
	<doc>
		<field name='id">12345</field>
		<field name="author_first_letter" suggestable="1">Smith</field>
		<field name="author" suggestable="1"  
preceding="author_first_letter">S</field>
		<field name="price_range" suggestable="1">0-25</field>
		<field name="price">14.23</field>
		<field name="name">Some Crazy Legal Book</field>
	</doc>
</add>

Once we reindex, we do a search for 'legal' again and our book is in  
it. Based on our index,  we can scan the resultset and see that the  
results have three suggestable fields, two of which do not require a  
preceding field.

We return a list of the attributes and their distinct values for  
author_first_letter and price_range.

If author_first_letter is specified as S in the enxt step of the  
search, then we offer the author attribute and the distinct values of  
author where author_first_letter = S

If price_range 0-25 is selected, then we don't show price in the  
metatdata list, since it is not suggestable.

I don't think this would be really much more difficult than trying to  
extract attributes, and it gives people an easier time adiopting this  
project without shutting down the default functionaity that the TODO  
list suggests right now.

and now to start that wiki... Thoughts?

corey


Re: metadata about result sets?

Posted by Chris Hostetter <ho...@fucit.org>.
: layer on top of that though, right? Lets say someone says 'Oh yeah, I
: mean thriller' then you want to show author. You don't want to list
: 5000 author names right off the bat -- you want to let them pick a
: letter like 'author last name starts with a', then display author

At a certian point, logic like that really becomes domain specific: you
have to know in advance that the author field is soemthng that you want to
have two levels of filtering (first by intial, then by full name)
meanwhile the category field should have the full list of facets
displayed, and the price field should have facets displayed based on
ranges -- but hwat should the ranges be?

In my experience, there's really two types of faceting: data driven
(pick/display facets based entirely on the terms found in the index) and
metadata driven (pick the facets based on configuration supplied by an
index maintainer who kows about hte nature of the data)

data driven faceting is something i think the standard request handler cna
support really easily just by having a parameter that lets you specify a
list of fields to display faceted counts for relative hte current search.

simple metadata driven faceting can be done using the second approach
described in the todo: letting the query client supply a list of
facet queries as part of the request -- but that really doesn't seem like
it can scale up to the level of thousands of authors whose names start
with "a"

That's the point where writing a custom query handler that knows about hte
rules you want to use to drive your facets really makes sense.

: letter step and just show the author names. What if the todo went
: something like

I'll be honest, i'm not really following whatyou mean here, but feel free
to elaborate and add your ideas to the wiki ... i would suggest making a
new "ComplexFacetingBrainstorming" wiki page with your ideas and link to
it from the Todo page.


-Hoss


Re: metadata about result sets?

Posted by Corey Tisdale <co...@shopperschoice.com>.
Yeah, that is exactly what I mean. There should possibly be one more  
layer on top of that though, right? Lets say someone says 'Oh yeah, I  
mean thriller' then you want to show author. You don't want to list  
5000 author names right off the bat -- you want to let them pick a  
letter like 'author last name starts with a', then display author  
names under that, so there is a sort of rule based precedent on when  
to return which value list. The flip side is, if there are only 5  
authors under thriller category, maybe we should skip the pick a  
letter step and just show the author names. What if the todo went  
something like

Simple faceted browsing (grouping) support in the standard query handler

	get resultset and possible refine fields
	limit refine fields to those appropriate based on result set and  
precedence of fields
	group by field (provide counts for each distinct value in that field)
	group by (query1, query2, query3, query4, query5)

Does that make sense? I can be really bad at explaining sometimes.


Corey

PS - This is gonna rock once it starts maturing.



On Mar 7, 2006, at 6:16 PM, Yonik Seeley wrote:

> On 3/7/06, Corey Tisdale <co...@shopperschoice.com> wrote:
>> Is there any way to get metadata about a search result off of
>> this bad dog?
>
> Hi Corey,
>
> From the standard request handler, you can currently only get the
> stored fields of documents that matched a query, and the relevancy
> score of each document.
>
>> I am trying to find  a good way to search through
>> several million items, and I think that being able to aggregate data
>> about the results would help people refine the search, so if they
>> search for "legal" I could show a result set with 22000 books but
>> then also ask if they meant thrillers or test prep or career
>> planning, etc.
>
> This type of functionality has been implemented with Solr (faceted
> browsing), but it's not out-of-the-box functionality - you need to
> implement a request handler to calculate and return this extra data.
>
> See here for an explanation on how CNET did it:
> http://www.mail-archive.com/java-user@lucene.apache.org/msg02645.html
>
> Simple built-in faceted browsing support is planned though... from
> http://wiki.apache.org/solr/TaskList
>
> * Simple faceted browsing (grouping) support in the standard query  
> handler
>     * group by field (provide counts for each distinct value in  
> that field)
>     * group by (query1, query2, query3, query4, query5)
>
> Would that meet your needs?
>
> -Yonik


Re: metadata about result sets?

Posted by Yonik Seeley <ys...@gmail.com>.
On 3/7/06, Corey Tisdale <co...@shopperschoice.com> wrote:
> Is there any way to get metadata about a search result off of
> this bad dog?

Hi Corey,

>From the standard request handler, you can currently only get the
stored fields of documents that matched a query, and the relevancy
score of each document.

> I am trying to find  a good way to search through
> several million items, and I think that being able to aggregate data
> about the results would help people refine the search, so if they
> search for "legal" I could show a result set with 22000 books but
> then also ask if they meant thrillers or test prep or career
> planning, etc.

This type of functionality has been implemented with Solr (faceted
browsing), but it's not out-of-the-box functionality - you need to
implement a request handler to calculate and return this extra data.

See here for an explanation on how CNET did it:
http://www.mail-archive.com/java-user@lucene.apache.org/msg02645.html

Simple built-in faceted browsing support is planned though... from
http://wiki.apache.org/solr/TaskList

* Simple faceted browsing (grouping) support in the standard query handler
    * group by field (provide counts for each distinct value in that field)
    * group by (query1, query2, query3, query4, query5)

Would that meet your needs?

-Yonik