You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Felipe <fe...@goshme.com> on 2010/01/12 21:44:42 UTC
Re: Is there any difference in a document between one added field
with a number of terms and a field added a number of times ?
You could change the boost of the field artist to be bigger than the field
alias.
field.setBoost(artistBoost);
2010/1/12 Paul Taylor <pa...@fastmail.fm>
> Been doing some analysis with Luke (BTW doesnt work with StandardAnalyzer
> since Version field introduced) and discovered a problem with field lenghth
> boosting for me.
>
> I have a document that represents a recording artist (i.e Madonna, The
> Beatles ectera) it contains an artist and an alias field, the alias field
> contains other names that the artist is maybe known as, and so there can be
> multiple aliases for an artist.
>
> PseudoCode:
> (
> doc.addField(ArtistIndexField.ARTIST, rs.getString("name"));
> for (String alias : aliases.get(artistId)) {
> doc.addField(ArtistIndexField.ALIAS, alias);
> }
> )
>
> Im finding that when I search by for the artist by the alias field if the
> value matches an alias in two different documents the document with the
> least number of aliases get the best score because the boost of the alias is
> split between the aliases on the other doc, if I ANALYSED_NO_NORMS then both
> documents return the same score.
>
> The trouble is I don't want to disable norms because I want a match on a
> single field containing less terms to score better than one with more
> scores.
>
> Full example:
>
>
> http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1
> return two results , the second result only has score of 8 because it more
> aliases than the first result, even the alias it matched on was an exact
> single term match.
> http://musicbrainz.org/show/artist/aliases.html?artistid=174327
>
> but if I remove norms then the following query (which is currently working)
>
>
> http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1
>
> would stop working, in that searching for 'The beatles' would no longer
> score rate artist 'The Beatles' better than 'The Beatles revival Band'
>
> So isn't there any way to recognise that repeated calls to addField() is
> not creating a single field with many terms,but many fields with few terms.
>
> thanks Paul
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
Felipe Lobo
www.jusbrasil.com.br
Re: Is there any difference in a document between one added field
with a number of terms and a field added a number of times ?
Posted by Paul Taylor <pa...@fastmail.fm>.
So not much help here, (I wonder if its because I posted 3 questions in
one day) but Ive made some progress in my understaning.
I understand there is only one norm per field and I think Lucene does no
differentiating between adding the same field a number of times and
adding mutiple text to the same field. But Ive discovered
getPositionIncrementGap() to seperate my multiple adds of the same field
within a doc and I was wondering if they was a way I could use the
position gap to get DefaultSimailrity.lengthNorm() to be called with
only the number of tokens within one field passed to it rather than the
complete terms within the field as a whole.
Paul
Paul Taylor wrote:
> Thanks Felipe, but you are missing the point Artist really doesnt
> come into it, my problem is confined to the alias field, forget about
> artist its just detailed to give the complete scenario
>
> Paul
>
> Felipe wrote:
>> You could change the boost of the field artist to be bigger than the
>> field alias.
>> field.setBoost(artistBoost);
>>
>>
>> 2010/1/12 Paul Taylor <paul_t100@fastmail.fm
>> <ma...@fastmail.fm>>
>>
>> Been doing some analysis with Luke (BTW doesnt work with
>> StandardAnalyzer since Version field introduced) and discovered a
>> problem with field lenghth boosting for me.
>>
>> I have a document that represents a recording artist (i.e Madonna,
>> The Beatles ectera) it contains an artist and an alias field, the
>> alias field contains other names that the artist is maybe known
>> as, and so there can be multiple aliases for an artist.
>>
>> PseudoCode:
>> (
>> doc.addField(ArtistIndexField.ARTIST, rs.getString("name"));
>> for (String alias : aliases.get(artistId)) {
>> doc.addField(ArtistIndexField.ALIAS, alias);
>> }
>> )
>>
>> Im finding that when I search by for the artist by the alias field
>> if the value matches an alias in two different documents the
>> document with the least number of aliases get the best score
>> because the boost of the alias is split between the aliases on the
>> other doc, if I ANALYSED_NO_NORMS then both documents return the
>> same score.
>>
>> The trouble is I don't want to disable norms because I want a
>> match on a single field containing less terms to score better than
>> one with more scores.
>>
>> Full example:
>>
>>
>> http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1
>>
>>
>> <http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1>
>>
>> return two results , the second result only has score of 8 because
>> it more aliases than the first result, even the alias it matched
>> on was an exact single term match.
>> http://musicbrainz.org/show/artist/aliases.html?artistid=174327
>>
>> but if I remove norms then the following query (which is currently
>> working)
>>
>>
>> http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1
>>
>>
>> <http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1>
>>
>>
>> would stop working, in that searching for 'The beatles' would no
>> longer score rate artist 'The Beatles' better than 'The Beatles
>> revival Band'
>>
>> So isn't there any way to recognise that repeated calls to
>> addField() is not creating a single field with many terms,but many
>> fields with few terms.
>>
>> thanks Paul
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> <ma...@lucene.apache.org>
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> <ma...@lucene.apache.org>
>>
>>
>>
>>
>> --
>> Felipe Lobo
>> www.jusbrasil.com.br <http://www.jusbrasil.com.br>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Is there any difference in a document between one added field
with a number of terms and a field added a number of times ?
Posted by Paul Taylor <pa...@fastmail.fm>.
Thanks Felipe, but you are missing the point Artist really doesnt come
into it, my problem is confined to the alias field, forget about artist
its just detailed to give the complete scenario
Paul
Felipe wrote:
> You could change the boost of the field artist to be bigger than the
> field alias.
> field.setBoost(artistBoost);
>
>
> 2010/1/12 Paul Taylor <paul_t100@fastmail.fm
> <ma...@fastmail.fm>>
>
> Been doing some analysis with Luke (BTW doesnt work with
> StandardAnalyzer since Version field introduced) and discovered a
> problem with field lenghth boosting for me.
>
> I have a document that represents a recording artist (i.e Madonna,
> The Beatles ectera) it contains an artist and an alias field, the
> alias field contains other names that the artist is maybe known
> as, and so there can be multiple aliases for an artist.
>
> PseudoCode:
> (
> doc.addField(ArtistIndexField.ARTIST, rs.getString("name"));
> for (String alias : aliases.get(artistId)) {
> doc.addField(ArtistIndexField.ALIAS, alias);
> }
> )
>
> Im finding that when I search by for the artist by the alias field
> if the value matches an alias in two different documents the
> document with the least number of aliases get the best score
> because the boost of the alias is split between the aliases on the
> other doc, if I ANALYSED_NO_NORMS then both documents return the
> same score.
>
> The trouble is I don't want to disable norms because I want a
> match on a single field containing less terms to score better than
> one with more scores.
>
> Full example:
>
> http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1
> <http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1>
> return two results , the second result only has score of 8 because
> it more aliases than the first result, even the alias it matched
> on was an exact single term match.
> http://musicbrainz.org/show/artist/aliases.html?artistid=174327
>
> but if I remove norms then the following query (which is currently
> working)
>
> http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1
> <http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1>
>
> would stop working, in that searching for 'The beatles' would no
> longer score rate artist 'The Beatles' better than 'The Beatles
> revival Band'
>
> So isn't there any way to recognise that repeated calls to
> addField() is not creating a single field with many terms,but many
> fields with few terms.
>
> thanks Paul
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> <ma...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org
> <ma...@lucene.apache.org>
>
>
>
>
> --
> Felipe Lobo
> www.jusbrasil.com.br <http://www.jusbrasil.com.br>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org