You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Daniel Einspanjer <de...@gmail.com> on 2007/03/28 23:28:47 UTC

Best approach for indexing and querying against a multivalue name field like directors or actors?

I'm rather new to Solr and somewhat rusty on what little I learned on
Lucene a few years back.

I've got some documents I want to index that have multiple name fields
such as directors or actors. I'm wanting to index them such that
querying for "Jane Doe" would have a higher score for "Jane M. Doe"
than for "John Doe", but I need to make sure that "Jane Doe" wouldn't
match a document with two directors, "Jane Smith" and "John Doe" at
all.

If anyone has done something like this and could suggest some of the
solr filters that might be useful to me, I'd greatly appreciate it.

Daniel

Re: Best approach for indexing and querying against a multivalue name field like directors or actors?

Posted by Daniel Einspanjer <de...@gmail.com>.

I'm sorry, I said something confusing there.
Let me try that last case again.

If you have three documents with a multivalue field named director
(represented here by ; separator)
1. "Jane M. Doe"
2. "Jane Smith"; "John Doe"
3. "John Doe"

And the user searched for director:"Jane Doe", I would ideally like 1.
to have the highest score and 2 and 3 to have nearly equal scores.
The experiments I've done so far have given 2. a score higher than 3.
because the terms Jane and Doe were found in document 2. even though
they were in separate instances of the multivalue field.

I hope this makes understanding my question better rather than worse. :)

Thanks,
Daniel

On 3/28/07, Daniel Einspanjer <de...@gmail.com> wrote:
> <snip> but I need to make sure that "Jane Doe" wouldn't
> match a document with two directors, "Jane Smith" and "John Doe" at
> all.

Re: Snippets of indexed text

Posted by Thierry Collogne <th...@gmail.com>.

Glad I could help you.

On 02/04/07, Pierre-Yves LANDRON <pl...@hotmail.com> wrote:
>
> >And this is the part for the highlighted text :
> >
> ><lst name="highlighting">
> ><lst name="col_36863_NL">
> >  <arr name="content">
> >    <str></str>
> >  </arr>
> ></lst>
> ></lst>
> >
>
> Yes it works just fine ! and it's great. :)
>
> Thanks Thierry : you were right, i didn't look for the right tag in the
> response.
> ( My problem with facets parameters is still unresolved but i will work on
> that later)
>
> The more i'm using solr, the more i'm glad i've choosen this way to work
> with lucene.
>
> Kind Regards...
> P-Yves Landron
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>

Re: Snippets of indexed text

Posted by Pierre-Yves LANDRON <pl...@hotmail.com>.

>And this is the part for the highlighted text :
>
><lst name="highlighting">
><lst name="col_36863_NL">
>  <arr name="content">
>    <str></str>
>  </arr>
></lst>
></lst>
>

Yes it works just fine ! and it's great. :)

Thanks Thierry : you were right, i didn't look for the right tag in the 
response.
( My problem with facets parameters is still unresolved but i will work on 
that later)

The more i'm using solr, the more i'm glad i've choosen this way to work 
with lucene.

Kind Regards...
P-Yves Landron

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Re: Snippets of indexed text

Posted by Thierry Collogne <th...@gmail.com>.

I can't see anything wrong. But perhaps you are looking at the wrong part of
the response. It is the same lake with facets.
You need to look further down in the xml reponse. Here I asked solr to
highlight the field "content" and I used a facer called type.

This is a sample of an xml response in our application

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">5</int>
 <lst name="params">
  <str name="rows">10</str>
  <str name="start">0</str>

  <str name="facet">true</str>
  <str name="q">stamp AND site:3</str>
  <str name="version">2.2</str>
  <str name="hl.fl">content</str>
  <str name="facet.field">type</str>
  <str name="indent">on</str>

  <str name="hl">true</str>
 </lst>
</lst>
<result name="response" numFound="1" start="0">
 <doc>
  <str name="id">col_36863_NL</str>
  <str name="authorisation">ALL</str>
  <str name="content"></str>
  <str name="type">HR</str>
 </doc>
</result>
<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields">
  <lst name="type">
    <int name="HR">1</int>
  </lst>
 </lst>
</lst>
<lst name="highlighting">
 <lst name="col_36863_NL">
  <arr name="content">
    <str></str>
  </arr>
 </lst>
</lst>
</response>


If you look at the end you see the following for facets

<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields">
  <lst name="type">
    <int name="HR">1</int>
  </lst>
 </lst>
</lst>


And this is the part for the highlighted text :

<lst name="highlighting">
 <lst name="col_36863_NL">
  <arr name="content">
    <str></str>
  </arr>
 </lst>
</lst>

I hope this helps a bit. By the way, if you are using java, it may be good
to check out the java client here

   http://issues.apache.org/jira/browse/SOLR-20

There is a comment with some code that I added. This code can be added to
the java client to support highlighting.

If you need anymore help, just post it and I will try to help more.


On 30/03/07, Pierre-Yves LANDRON <pl...@hotmail.com> wrote:
>
> hello,
>
> thanks for the info ; it's exactly what i need. i can't manage to make it
> works, though. it's strange because i have the same problem with facets :
> it
> seems that some options are not taken in account...
>
> for example, here is my request to solr:
>
> q=%28%28titre:moulin%29+OR+%28texte:moulin%29+OR+%28sujet:moulin%29+OR+%28desc:moulin%29%29&version=
> 2.1&start=0&rows=12&fl=*+score&qt=standard&hl=true&hl.fl=texte
> ,desc&hl.snippets=3&hl.fragsize=150
>
> and an extract of the response is :
> <doc>
> <float name="score">0.0151801035</float>
> <str name="PID">bml:8071</str>
> <str name="texte">
> Les Grands Moulins
> Le chemin de la Bouteille n'est pas, comme son nom semblerait l'indiquer,
> le
> chemin préféré des ivrognes. En l'occurrence, c'est plutôt le chemin des
> Boulangers ou mieux encore (... cutted by me, in fact all the field is
> returned)
> </str>
> <str name="thumb">http://10.208.0.215:8080/fedora/get/bml:8071/Thumb</str>
> <str name="type">page</str>
> </doc>
>
> obviously  the hl parameters haven't been taken in account. I've hot the
> same problem with the facet.mincount parameter; facets works fine, but
> this
> parameter is not taken in account for some reason...
>
> did i done something wrong ?
>
> thanks,
> kind regards,
> p-y
>
>
>
>
> >From: "Thierry Collogne" <th...@gmail.com>
> >Reply-To: solr-user@lucene.apache.org
> >To: solr-user@lucene.apache.org
> >Subject: Re: Snippets of indexed text
> >Date: Thu, 29 Mar 2007 08:56:51 +0200
> >
> >It is possible. You need to pass highlighting parameters. Look here :
> >
> >      http://wiki.apache.org/solr/HighlightingParameters
> >
> >Hope this helps.
> >
>
> _________________________________________________________________
> It's tax season, make sure to follow these few simple tips
>
> http://articles.moneycentral.msn.com/Taxes/PreparationTips/PreparationTips.aspx?icid=HMMartagline
>
>

Re: Snippets of indexed text

Posted by Pierre-Yves LANDRON <pl...@hotmail.com>.

hello,

thanks for the info ; it's exactly what i need. i can't manage to make it 
works, though. it's strange because i have the same problem with facets : it 
seems that some options are not taken in account...

for example, here is my request to solr:
q=%28%28titre:moulin%29+OR+%28texte:moulin%29+OR+%28sujet:moulin%29+OR+%28desc:moulin%29%29&version=2.1&start=0&rows=12&fl=*+score&qt=standard&hl=true&hl.fl=texte,desc&hl.snippets=3&hl.fragsize=150

and an extract of the response is :
<doc>
<float name="score">0.0151801035</float>
<str name="PID">bml:8071</str>
<str name="texte">
Les Grands Moulins
Le chemin de la Bouteille n'est pas, comme son nom semblerait l'indiquer, le 
chemin préféré des ivrognes. En l'occurrence, c'est plutôt le chemin des 
Boulangers ou mieux encore (... cutted by me, in fact all the field is 
returned)
</str>
<str name="thumb">http://10.208.0.215:8080/fedora/get/bml:8071/Thumb</str>
<str name="type">page</str>
</doc>

obviously  the hl parameters haven't been taken in account. I've hot the 
same problem with the facet.mincount parameter; facets works fine, but this 
parameter is not taken in account for some reason...

did i done something wrong ?

thanks,
kind regards,
p-y




>From: "Thierry Collogne" <th...@gmail.com>
>Reply-To: solr-user@lucene.apache.org
>To: solr-user@lucene.apache.org
>Subject: Re: Snippets of indexed text
>Date: Thu, 29 Mar 2007 08:56:51 +0200
>
>It is possible. You need to pass highlighting parameters. Look here :
>
>      http://wiki.apache.org/solr/HighlightingParameters
>
>Hope this helps.
>

_________________________________________________________________
Its tax season, make sure to follow these few simple tips 
http://articles.moneycentral.msn.com/Taxes/PreparationTips/PreparationTips.aspx?icid=HMMartagline

Re: Snippets of indexed text

Posted by Thierry Collogne <th...@gmail.com>.

It is possible. You need to pass highlighting parameters. Look here :

      http://wiki.apache.org/solr/HighlightingParameters

Hope this helps.

On 29/03/07, Pierre-Yves LANDRON <pl...@hotmail.com> wrote:
>
> Hello everybody !
>
> I wondering if there a way to get some relevant snippets (searched terms
> contextualized) of indexed text with a solr response to a query, instead
> of
> just the entire indexed field ? ( more widely, what are the possibilities
> to
> let solr formate the answer (highlight terms, etc.) ? )
>
> Thanks,
> Kind regards,
> P-Y Landron
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>

Snippets of indexed text

Posted by Pierre-Yves LANDRON <pl...@hotmail.com>.

Hello everybody !

I wondering if there a way to get some relevant snippets (searched terms 
contextualized) of indexed text with a solr response to a query, instead of 
just the entire indexed field ? ( more widely, what are the possibilities to 
let solr formate the answer (highlight terms, etc.) ? )

Thanks,
Kind regards,
P-Y Landron

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Re: Best approach for indexing and querying against a multivalue name field like directors or actors?

Posted by Chris Hostetter <ho...@fucit.org>.

you'll want to look into the positionIncrementGap attribute that can be
specified when defining an Analyzer for your field type ... it defines the
"logical" gap between tokens in a multi-value field, so if you use a
whitespace tokenizer add the names "Jane Smith" and "John Doe" you'll get
the tokens "Jane", "Smith", ... John", "Doe" with a big gap between Smith
and John .. so now you cna do phrase queries and as long as the slop on
your phrase queries is less the the gap you used you don't have to worry
about false matches on "Jane Doe"



: Date: Wed, 28 Mar 2007 17:28:47 -0400
: From: Daniel Einspanjer <de...@gmail.com>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Best approach for indexing and querying against a multivalue
:     name field like directors or actors?
:
: I'm rather new to Solr and somewhat rusty on what little I learned on
: Lucene a few years back.
:
: I've got some documents I want to index that have multiple name fields
: such as directors or actors. I'm wanting to index them such that
: querying for "Jane Doe" would have a higher score for "Jane M. Doe"
: than for "John Doe", but I need to make sure that "Jane Doe" wouldn't
: match a document with two directors, "Jane Smith" and "John Doe" at
: all.
:
: If anyone has done something like this and could suggest some of the
: solr filters that might be useful to me, I'd greatly appreciate it.
:
: Daniel
:



-Hoss