You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by mmoser <mm...@balihoo.com> on 2006/12/15 22:17:05 UTC

Sorting by a percentage On a field

So, I am still new to Lucene, so please take this into consideration when
reading this. Up until now, a novice like myself has been able to finagle
Lucene into doing what we want. But now we have a problem that I have been
searching for the answer to. We allow users to profile our products with a
predetermined profile attribute id. We then want to take all the users
profiles on a product and take a particular number of times that this
particular profile attribute id has been chosen and come out with a
percentage for it. This is no problem. Where the problem comes into play is
that we want the user to be able to search for products that match that
particular profile attribute id. We want the higher percentages to come up
on top. To add to the complexity, we want to be able to allow for the user
to select multiple profile attribute ids and still have a combination of the
score to come up higher. Keep in mind, we would like to somehow keep these
in one field, because we are trying to use the same algorithm for something
that could potentially become very large. Any suggestions. The more detail,
the better. 

Example:

Product A
Attribute ID = 2    Percentage Chosen = 50%
Attribute ID = 5    Percentage Chosen = 30%
Attribute ID = 3    Percentage Chosen = 10%
Attribute ID = 13    Percentage Chosen = 30%

Product B
Attribute ID = 1    Percentage Chosen = 50%
Attribute ID = 2    Percentage Chosen = 20%
Attribute ID = 3    Percentage Chosen = 75%

So if a user selected the attributes that correspond to 2 and 3, then
Product B should show up before Product A because it has a combined score of
95% and A has a combined score of 40%.

Thanks for any help.


-- 
View this message in context: http://www.nabble.com/Sorting-by-a-percentage-On-a-field-tf2829501.html#a7899335
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting by a percentage On a field

Posted by Doron Cohen <DO...@il.ibm.com>.
I think there's a better way - using SpanFirstQuery.

This query scores higher a term positioned closer to the beginning of a
document.

Assume your resolution is 1%, i.e. 1 to 100, create the percentage field
with max token position 99.  Generate the tokens such that all IDs with
percentage 100 would have token position of 0, all those with percentage 99
would have token position of 1, all those with percentage n would have
token position of 100-n. I never done it myself but I know it is doable,
see Token.setPositionIncrement().

So for document A below the percentage field would look like this: "2[50]
5[20] 13[0] 3[20]" - read this as tokenWord[tokenPositionIcrement]. So this
would places "2" in position 50, "5" and "13" in position 70, and "3" in
position 90.

The query for "2" and "3" would be composed of two SpanFirstQuery objects -
one with "2" and one with "3" - composed in a BooleanQuery I guess.

You can change the effect of the distance on the score by providing your
Similarity class and overriding sloppyFreq().

Regards,
Doron

mmoser <mm...@balihoo.com> wrote on 15/12/2006 21:09:27:
>
> Thanks for your reply. We have currently thought about both of these
> approaches, so that definitely makes me feel better about things. The
first
> approach you had mentioned, we had thought about our tagging problem and
how
> to make a product tag come to the top, but again, with a lot of tags, the
> data becomes extremely large. Even if we were to take some magnitude and
> apply it, it could possibly be huge due to an infinite number of tags.
>
> The second approach we had considered doing a version almost like that. I
> had considered that if we have a query such as the one you are talking
about
> mixed with other fields, then the query would be extremely big.
Especially
> if the user was looking for 15 attribute ids which would be 15 * 10 = 150
> different ORed term searches alone  if we were only searching every 10
> percent.
>
> I definitely thank you for this input and would love to hear if anyone
has
> any other approaches. If not, we will probably attempt one of these
routes
> and see which one is a bigger storage / performance hit.
>
> Mike
>
>
>
> Doron Cohen wrote:
> >
> > I think the right solution for this would use "payloads", where extra
data
> > can be added for each index token. However Lucene currently does not
> > support this. Without this I can think of two options, each with its
own
> > disadvantage:
> >
> > 1) more tokens at indexing time - decide on the resolution of the
> > percentage - say it is 5% - and add more tokens of the same. For
example,
> > the attributes field for product A in your example would look like: "2
2 2
> > 2 2 2 2 2 2 2 5 5 5 5 5 5 3 3 13 13 13 13 13 13".
> >
> > 2) more tokens at search time - at indexing, include the percentage in
the
> > token. So for product A you would have: "2x50 5x30 3x10 13x30". At
search
> > time, expand the query accordingly. So the query for attribute 5 would
be
> > expanded to: "5x5^5 5x10^10 5x15^15 5x20^20 ... 5x95^95 5x100^100".
> >
> > The first approach would enlarge the index, so if you have lots of data
> > that could eventually be a problem.
> > The second approach would end up with a large query, so, again, if you
> > have
> > lots of data that could eventually be a problem with search time.
> >
> > Also, depending how strict you want the scoring to be, you may want to
> > omit
> > norms for this field.
> >
> > Hope this helps,
> > Doron
> >
> > mmoser <mm...@balihoo.com> wrote on 15/12/2006 13:17:05:
> >>
> >> So, I am still new to Lucene, so please take this into consideration
when
> >> reading this. Up until now, a novice like myself has been able to
finagle
> >> Lucene into doing what we want. But now we have a problem that I have
> > been
> >> searching for the answer to. We allow users to profile our products
with
> > a
> >> predetermined profile attribute id. We then want to take all the users
> >> profiles on a product and take a particular number of times that this
> >> particular profile attribute id has been chosen and come out with a
> >> percentage for it. This is no problem. Where the problem comes into
play
> > is
> >> that we want the user to be able to search for products that match
that
> >> particular profile attribute id. We want the higher percentages to
come
> > up
> >> on top. To add to the complexity, we want to be able to allow for the
> > user
> >> to select multiple profile attribute ids and still have a combination
of
> > the
> >> score to come up higher. Keep in mind, we would like to somehow keep
> > these
> >> in one field, because we are trying to use the same algorithm for
> > something
> >> that could potentially become very large. Any suggestions. The more
> > detail,
> >> the better.
> >>
> >> Example:
> >>
> >> Product A
> >> Attribute ID = 2    Percentage Chosen = 50%
> >> Attribute ID = 5    Percentage Chosen = 30%
> >> Attribute ID = 3    Percentage Chosen = 10%
> >> Attribute ID = 13    Percentage Chosen = 30%
> >>
> >> Product B
> >> Attribute ID = 1    Percentage Chosen = 50%
> >> Attribute ID = 2    Percentage Chosen = 20%
> >> Attribute ID = 3    Percentage Chosen = 75%
> >>
> >> So if a user selected the attributes that correspond to 2 and 3, then
> >> Product B should show up before Product A because it has a combined
score
> > of
> >> 95% and A has a combined score of 40%.
> >>
> >> Thanks for any help.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Sorting-by-a-
> percentage-On-a-field-tf2829501.html#a7903444
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting by a percentage On a field

Posted by mmoser <mm...@balihoo.com>.
Thanks for your reply. We have currently thought about both of these
approaches, so that definitely makes me feel better about things. The first
approach you had mentioned, we had thought about our tagging problem and how
to make a product tag come to the top, but again, with a lot of tags, the
data becomes extremely large. Even if we were to take some magnitude and
apply it, it could possibly be huge due to an infinite number of tags.

The second approach we had considered doing a version almost like that. I
had considered that if we have a query such as the one you are talking about
mixed with other fields, then the query would be extremely big. Especially
if the user was looking for 15 attribute ids which would be 15 * 10 = 150
different ORed term searches alone  if we were only searching every 10
percent. 

I definitely thank you for this input and would love to hear if anyone has
any other approaches. If not, we will probably attempt one of these routes
and see which one is a bigger storage / performance hit.

Mike



Doron Cohen wrote:
>  
> I think the right solution for this would use "payloads", where extra data
> can be added for each index token. However Lucene currently does not
> support this. Without this I can think of two options, each with its own
> disadvantage:
> 
> 1) more tokens at indexing time - decide on the resolution of the
> percentage - say it is 5% - and add more tokens of the same. For example,
> the attributes field for product A in your example would look like: "2 2 2
> 2 2 2 2 2 2 2 5 5 5 5 5 5 3 3 13 13 13 13 13 13".
> 
> 2) more tokens at search time - at indexing, include the percentage in the
> token. So for product A you would have: "2x50 5x30 3x10 13x30". At search
> time, expand the query accordingly. So the query for attribute 5 would be
> expanded to: "5x5^5 5x10^10 5x15^15 5x20^20 ... 5x95^95 5x100^100".
> 
> The first approach would enlarge the index, so if you have lots of data
> that could eventually be a problem.
> The second approach would end up with a large query, so, again, if you
> have
> lots of data that could eventually be a problem with search time.
> 
> Also, depending how strict you want the scoring to be, you may want to
> omit
> norms for this field.
> 
> Hope this helps,
> Doron
> 
> mmoser <mm...@balihoo.com> wrote on 15/12/2006 13:17:05:
>>
>> So, I am still new to Lucene, so please take this into consideration when
>> reading this. Up until now, a novice like myself has been able to finagle
>> Lucene into doing what we want. But now we have a problem that I have
> been
>> searching for the answer to. We allow users to profile our products with
> a
>> predetermined profile attribute id. We then want to take all the users
>> profiles on a product and take a particular number of times that this
>> particular profile attribute id has been chosen and come out with a
>> percentage for it. This is no problem. Where the problem comes into play
> is
>> that we want the user to be able to search for products that match that
>> particular profile attribute id. We want the higher percentages to come
> up
>> on top. To add to the complexity, we want to be able to allow for the
> user
>> to select multiple profile attribute ids and still have a combination of
> the
>> score to come up higher. Keep in mind, we would like to somehow keep
> these
>> in one field, because we are trying to use the same algorithm for
> something
>> that could potentially become very large. Any suggestions. The more
> detail,
>> the better.
>>
>> Example:
>>
>> Product A
>> Attribute ID = 2    Percentage Chosen = 50%
>> Attribute ID = 5    Percentage Chosen = 30%
>> Attribute ID = 3    Percentage Chosen = 10%
>> Attribute ID = 13    Percentage Chosen = 30%
>>
>> Product B
>> Attribute ID = 1    Percentage Chosen = 50%
>> Attribute ID = 2    Percentage Chosen = 20%
>> Attribute ID = 3    Percentage Chosen = 75%
>>
>> So if a user selected the attributes that correspond to 2 and 3, then
>> Product B should show up before Product A because it has a combined score
> of
>> 95% and A has a combined score of 40%.
>>
>> Thanks for any help.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Sorting-by-a-percentage-On-a-field-tf2829501.html#a7903444
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Sorting by a percentage On a field

Posted by Doron Cohen <DO...@il.ibm.com>.
I think the right solution for this would use "payloads", where extra data
can be added for each index token. However Lucene currently does not
support this. Without this I can think of two options, each with its own
disadvantage:

1) more tokens at indexing time - decide on the resolution of the
percentage - say it is 5% - and add more tokens of the same. For example,
the attributes field for product A in your example would look like: "2 2 2
2 2 2 2 2 2 2 5 5 5 5 5 5 3 3 13 13 13 13 13 13".

2) more tokens at search time - at indexing, include the percentage in the
token. So for product A you would have: "2x50 5x30 3x10 13x30". At search
time, expand the query accordingly. So the query for attribute 5 would be
expanded to: "5x5^5 5x10^10 5x15^15 5x20^20 ... 5x95^95 5x100^100".

The first approach would enlarge the index, so if you have lots of data
that could eventually be a problem.
The second approach would end up with a large query, so, again, if you have
lots of data that could eventually be a problem with search time.

Also, depending how strict you want the scoring to be, you may want to omit
norms for this field.

Hope this helps,
Doron

mmoser <mm...@balihoo.com> wrote on 15/12/2006 13:17:05:
>
> So, I am still new to Lucene, so please take this into consideration when
> reading this. Up until now, a novice like myself has been able to finagle
> Lucene into doing what we want. But now we have a problem that I have
been
> searching for the answer to. We allow users to profile our products with
a
> predetermined profile attribute id. We then want to take all the users
> profiles on a product and take a particular number of times that this
> particular profile attribute id has been chosen and come out with a
> percentage for it. This is no problem. Where the problem comes into play
is
> that we want the user to be able to search for products that match that
> particular profile attribute id. We want the higher percentages to come
up
> on top. To add to the complexity, we want to be able to allow for the
user
> to select multiple profile attribute ids and still have a combination of
the
> score to come up higher. Keep in mind, we would like to somehow keep
these
> in one field, because we are trying to use the same algorithm for
something
> that could potentially become very large. Any suggestions. The more
detail,
> the better.
>
> Example:
>
> Product A
> Attribute ID = 2    Percentage Chosen = 50%
> Attribute ID = 5    Percentage Chosen = 30%
> Attribute ID = 3    Percentage Chosen = 10%
> Attribute ID = 13    Percentage Chosen = 30%
>
> Product B
> Attribute ID = 1    Percentage Chosen = 50%
> Attribute ID = 2    Percentage Chosen = 20%
> Attribute ID = 3    Percentage Chosen = 75%
>
> So if a user selected the attributes that correspond to 2 and 3, then
> Product B should show up before Product A because it has a combined score
of
> 95% and A has a combined score of 40%.
>
> Thanks for any help.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org