You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Felipe <fe...@goshme.com> on 2010/01/12 21:44:42 UTC

Re: Is there any difference in a document between one added field with a number of terms and a field added a number of times ?

You could change the boost of the field artist to be bigger than the field
alias.
    field.setBoost(artistBoost);


2010/1/12 Paul Taylor <pa...@fastmail.fm>

> Been doing some analysis with Luke (BTW doesnt work with StandardAnalyzer
> since Version field introduced) and discovered a problem with field lenghth
> boosting for me.
>
> I have a document that represents a recording artist (i.e Madonna, The
> Beatles ectera) it contains an artist and an alias field, the alias field
> contains other names that the artist is maybe known as, and so there can be
> multiple aliases for an artist.
>
> PseudoCode:
> (
> doc.addField(ArtistIndexField.ARTIST, rs.getString("name"));
> for (String alias : aliases.get(artistId)) {
>     doc.addField(ArtistIndexField.ALIAS, alias);
> }
> )
>
> Im finding that when I search by for the artist by the alias field if the
> value matches an alias in two different documents the document with the
> least number of aliases get the best score because the boost of the alias is
> split between the aliases on the other doc, if I ANALYSED_NO_NORMS then both
> documents return the same score.
>
> The trouble is I don't want to disable norms because I want a match on a
> single field containing less terms to score better than one with more
> scores.
>
> Full example:
>
>
> http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1
> return two results , the second result only has score of 8 because it more
> aliases than the first result, even the alias it matched on was an exact
> single term match.
> http://musicbrainz.org/show/artist/aliases.html?artistid=174327
>
> but if I remove norms then the following query (which is currently working)
>
>
> http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1
>
> would stop working, in that  searching for 'The beatles' would no longer
> score rate artist 'The Beatles' better than 'The Beatles revival Band'
>
> So isn't there any way to recognise that repeated calls to addField() is
> not creating a single field with many terms,but many fields with few terms.
>
> thanks Paul
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Felipe Lobo
www.jusbrasil.com.br

Re: Is there any difference in a document between one added field with a number of terms and a field added a number of times ?

Posted by Paul Taylor <pa...@fastmail.fm>.
So not much help here, (I wonder if its because I posted 3 questions in 
one day) but Ive made some progress in my understaning.

I understand there is only one norm per field and I think Lucene does no 
differentiating between adding the same field a number of times and 
adding mutiple text to the same field. But Ive discovered 
getPositionIncrementGap() to seperate my multiple adds of the same field 
within a doc and I was wondering if they was a way I could use the 
position gap to get DefaultSimailrity.lengthNorm() to be called with 
only the number of tokens within one field passed to it rather than the 
complete terms within the field as a whole.

Paul

Paul Taylor wrote:
> Thanks Felipe, but you  are missing the point Artist really doesnt 
> come into it, my problem is confined to the alias field, forget about 
> artist its just detailed to give the complete scenario
>
> Paul
>
> Felipe wrote:
>> You could change the boost of the field artist to be bigger than the 
>> field alias.
>>     field.setBoost(artistBoost);
>>
>>
>> 2010/1/12 Paul Taylor <paul_t100@fastmail.fm 
>> <ma...@fastmail.fm>>
>>
>>     Been doing some analysis with Luke (BTW doesnt work with
>>     StandardAnalyzer since Version field introduced) and discovered a
>>     problem with field lenghth boosting for me.
>>
>>     I have a document that represents a recording artist (i.e Madonna,
>>     The Beatles ectera) it contains an artist and an alias field, the
>>     alias field contains other names that the artist is maybe known
>>     as, and so there can be multiple aliases for an artist.
>>
>>     PseudoCode:
>>     (
>>     doc.addField(ArtistIndexField.ARTIST, rs.getString("name"));
>>     for (String alias : aliases.get(artistId)) {
>>         doc.addField(ArtistIndexField.ALIAS, alias);
>>     }
>>     )
>>
>>     Im finding that when I search by for the artist by the alias field
>>     if the value matches an alias in two different documents the
>>     document with the least number of aliases get the best score
>>     because the boost of the alias is split between the aliases on the
>>     other doc, if I ANALYSED_NO_NORMS then both documents return the
>>     same score.
>>
>>     The trouble is I don't want to disable norms because I want a
>>     match on a single field containing less terms to score better than
>>     one with more scores.
>>
>>     Full example:
>>
>>     
>> http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1 
>>
>>     
>> <http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1> 
>>
>>     return two results , the second result only has score of 8 because
>>     it more aliases than the first result, even the alias it matched
>>     on was an exact single term match.
>>     http://musicbrainz.org/show/artist/aliases.html?artistid=174327
>>
>>     but if I remove norms then the following query (which is currently
>>     working)
>>
>>     
>> http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1 
>>
>>     
>> <http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1> 
>>
>>
>>     would stop working, in that  searching for 'The beatles' would no
>>     longer score rate artist 'The Beatles' better than 'The Beatles
>>     revival Band'
>>
>>     So isn't there any way to recognise that repeated calls to
>>     addField() is not creating a single field with many terms,but many
>>     fields with few terms.
>>
>>     thanks Paul
>>
>>
>>
>>
>>     
>> ---------------------------------------------------------------------
>>     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>     <ma...@lucene.apache.org>
>>     For additional commands, e-mail: java-user-help@lucene.apache.org
>>     <ma...@lucene.apache.org>
>>
>>
>>
>>
>> -- 
>> Felipe Lobo
>> www.jusbrasil.com.br <http://www.jusbrasil.com.br>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there any difference in a document between one added field with a number of terms and a field added a number of times ?

Posted by Paul Taylor <pa...@fastmail.fm>.
Thanks Felipe, but you  are missing the point Artist really doesnt come 
into it, my problem is confined to the alias field, forget about artist 
its just detailed to give the complete scenario

Paul

Felipe wrote:
> You could change the boost of the field artist to be bigger than the 
> field alias.
>     field.setBoost(artistBoost);
>
>
> 2010/1/12 Paul Taylor <paul_t100@fastmail.fm 
> <ma...@fastmail.fm>>
>
>     Been doing some analysis with Luke (BTW doesnt work with
>     StandardAnalyzer since Version field introduced) and discovered a
>     problem with field lenghth boosting for me.
>
>     I have a document that represents a recording artist (i.e Madonna,
>     The Beatles ectera) it contains an artist and an alias field, the
>     alias field contains other names that the artist is maybe known
>     as, and so there can be multiple aliases for an artist.
>
>     PseudoCode:
>     (
>     doc.addField(ArtistIndexField.ARTIST, rs.getString("name"));
>     for (String alias : aliases.get(artistId)) {
>         doc.addField(ArtistIndexField.ALIAS, alias);
>     }
>     )
>
>     Im finding that when I search by for the artist by the alias field
>     if the value matches an alias in two different documents the
>     document with the least number of aliases get the best score
>     because the boost of the alias is split between the aliases on the
>     other doc, if I ANALYSED_NO_NORMS then both documents return the
>     same score.
>
>     The trouble is I don't want to disable norms because I want a
>     match on a single field containing less terms to score better than
>     one with more scores.
>
>     Full example:
>
>     http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1
>     <http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1>
>     return two results , the second result only has score of 8 because
>     it more aliases than the first result, even the alias it matched
>     on was an exact single term match.
>     http://musicbrainz.org/show/artist/aliases.html?artistid=174327
>
>     but if I remove norms then the following query (which is currently
>     working)
>
>     http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1
>     <http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1>
>
>     would stop working, in that  searching for 'The beatles' would no
>     longer score rate artist 'The Beatles' better than 'The Beatles
>     revival Band'
>
>     So isn't there any way to recognise that repeated calls to
>     addField() is not creating a single field with many terms,but many
>     fields with few terms.
>
>     thanks Paul
>
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>     <ma...@lucene.apache.org>
>     For additional commands, e-mail: java-user-help@lucene.apache.org
>     <ma...@lucene.apache.org>
>
>
>
>
> -- 
> Felipe Lobo
> www.jusbrasil.com.br <http://www.jusbrasil.com.br>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org