You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Guangwei Yuan <gu...@gmail.com> on 2007/09/28 10:00:26 UTC
Color search
Hi,
We're running an e-commerce site that provides product search. We've been
able to extract colors from product images, and we think it'd be cool and
useful to search products by color. A product image can have up to 5 colors
(from a color space of about 100 colors), so we can implement it easily with
Solr's facet search (thanks all who've developed Solr).
The problem arises when we try to sort the results by the color relevancy.
What's different from a normal facet search is that colors are weighted. For
example, a black dress can have 70% of black, 20% of gray, 10% of brown. A
search query "color:black" should return results in which the black dress
ranks higher than other products with less percentage of black.
My question is: how to configure and index the color field so that products
with higher percentage of color X ranks higher for query "color:X"?
Thanks for your help!
- Guangwei
Re: Color search
Posted by Matthew Runo <mr...@zappos.com>.
This discussion is incredibly interesting to me! We solved this
simply by indexing the color names, and faceting on that. Not a very
elegant solution, to be sure - but it works. If people search for a
"green running shoe" they get -green- running shoes.
I would be very very interested in having a color picker ajax app
which then went out and found the products with colors most like the
one you chose.
+--------------------------------------------------------+
| Matthew Runo
| Zappos Development
| mruno@zappos.com
| 702-943-7833
+--------------------------------------------------------+
On Sep 28, 2007, at 1:00 AM, Guangwei Yuan wrote:
> Hi,
>
> We're running an e-commerce site that provides product search.
> We've been
> able to extract colors from product images, and we think it'd be
> cool and
> useful to search products by color. A product image can have up to
> 5 colors
> (from a color space of about 100 colors), so we can implement it
> easily with
> Solr's facet search (thanks all who've developed Solr).
>
> The problem arises when we try to sort the results by the color
> relevancy.
> What's different from a normal facet search is that colors are
> weighted. For
> example, a black dress can have 70% of black, 20% of gray, 10% of
> brown. A
> search query "color:black" should return results in which the black
> dress
> ranks higher than other products with less percentage of black.
>
> My question is: how to configure and index the color field so that
> products
> with higher percentage of color X ranks higher for query "color:X"?
>
> Thanks for your help!
>
> - Guangwei
Re: Color search
Posted by Chris Hostetter <ho...@fucit.org>.
: I used the same field name (color), not 10 different names (c0 - c9).
ah .. got it. then what you are probably seeing is because of length
normalization, if you use omitNorms="true" then it shouldn't matter.
(i don't know why i suggested a seperate field for each 10% block ... i'm
sure i had a good reason but i can't think of it now)
-Hoss
Re: Color search
Posted by Guangwei Yuan <gu...@gmail.com>.
>
> can you you explain exactly how you are indexing the data and what your
> query looks like?
>
I used the same field name (color), not 10 different names (c0 - c9).
So the index fields look like (50% #000000, 20% #999999):
color: #000000
color: #000000
color: #000000
color: #000000
color: #000000
color: #999999
color: #999999
The query for black dresses will be:
color:#000000
Re: Color search
Posted by Chris Hostetter <ho...@fucit.org>.
: extraction algorithm, etc.) So, for a product with 50% of #000000, and 20%
: of #999999, I'll have to fill the remaining three fields with some dummy
: values. Otherwise, Lucene seems to score it higher than products that also
: have 50% of #000000, but more than 20% of some other colors. Since I also
that doesn't really make sense to me ... your input is colors to search
for, and you query each of those colors against every field right? so if
i said i want grey and red dresses, you query for...
+(c0:grey c1:grey c2:grey c3:grey c4:grey
c5:grey c6:grey c7:grey c8:grey c9:grey)
+(c0:red c1:red c2:red c3:red c4:red
c5:red c6:red c7:red c8:red)
...right? a document that doesn't have any value in c6, c7 or c8
shouldn't score higher then any other documents ... if anything it should
score lower because of the coord factor.
can you you explain exactly how you are indexing the data and what your
query looks like?
-Hoss
Re: Color search
Posted by Guangwei Yuan <gu...@gmail.com>.
Thanks for all the replies. I think creating 10 fields and feeding each
field with a color's value for 10% from that color is a reasonable approach,
and easy to implement too. One problem though, is that not all products have
a total of 100% colors (due to various reasons including our color
extraction algorithm, etc.) So, for a product with 50% of #000000, and 20%
of #999999, I'll have to fill the remaining three fields with some dummy
values. Otherwise, Lucene seems to score it higher than products that also
have 50% of #000000, but more than 20% of some other colors. Since I also
need a way to exclude the dummy value when faceting, is there a neater
solution?
I'll certainly look at the payload functionality, which is new to me :)
- Guangwei
Re: Color search
Posted by Mike Klaas <mi...@gmail.com>.
On 28-Sep-07, at 6:31 AM, Grant Ingersoll wrote:
> Another option would be to extend Solr (and donate back) to
> incorporate Lucene's payload functionality, in which case you could
> associate the percentile of the color as a payload and use the
> BoostingTermQuery... :-) If you're interested in this, a
> discussion on solr-dev is probably warranted to figure out the best
> way to do this.
For reference, here is a summary of the changes needed:
1. A payload analyzer (here is an example that tokenizes strings of
<token>:<whatever>:<score> into <token> with payload <score>:
/** Returns the next token in the stream, or null at EOS. */
public final Token next() throws IOException {
Token t = input.next();
if (null == t)
return null;
String s = t.termText();
if(s.indexOf(":") > -1 ) {
String []parts = s.split(":");
assert parts.length == 3;
String colour = parts[0];
int bits = Float.floatToIntBits(Float.parseFloat(parts[1]));
byte []buf = new byte[4];
for(int shift=0, i=0; shift < 32; shift += 8, i++) {
buf[i] = (byte)( (bits>>shift) & 0xff );
}
Token gen = new Token(colour, t.startOffset(), t.endOffset());
gen.setPayload(new Payload(buf));
t = gen;
}
return t;
}
2. A payload deserializer. Add this method to your custom Similarity
class:
public float scorePayload(byte [] payload, int offset, int length) {
assert length == 4;
int accum = ((payload[0+offset]&0xff)) |
((payload[1+offset]&0xff)<<8) |
((payload[2+offset]&0xff)<<16) |
((payload[3+offset]&0xff)<<24);
return Float.intBitsToFloat(accum);
}
3. Add a relevant query clause. In a custom request handler, you
could have a parameter to add BoostingTermQueries:
q= new BoostingTermQuery(new Term("colourPayload", colour))
query.add(q, Occur.SHOULD);
How to add this generically is an interesting question. There are
many possibilities, especially on the request handler and tokenizer
side of things. If there is a consensus on a sensible way of doing
this, I could contribute the bits of code that I have.
HTH,
-Mike
Re: Color search
Posted by Grant Ingersoll <gs...@apache.org>.
Another option would be to extend Solr (and donate back) to
incorporate Lucene's payload functionality, in which case you could
associate the percentile of the color as a payload and use the
BoostingTermQuery... :-) If you're interested in this, a discussion
on solr-dev is probably warranted to figure out the best way to do this.
-Grant
On Sep 28, 2007, at 9:23 AM, Yonik Seeley wrote:
> If it were just a couple of colors, you could have a separate field
> for each color and then index the percent in that field.
>
> black:70
> grey:20
>
> and then you could use a function query to influence the score (or you
> could sort by the color percent).
>
> However, this doesn't scale well to a large index with a large
> number of colors.
> Each field used like that will take up 4 bytes per document in the
> index.
>
> so if you have 1M documents, that's 1Mdocs * 100colors * 4bytes =
> 400MB
> Doable depending on your index size (use "int" or "float" and not
> "sint" or "sfloat" type for this... it will be better on the memory).
>
> If you needed to be better on the memory, you could encode all of the
> colors into a single value (perhaps into a compact string... one
> percentile per byte or something) and then have a custom function that
> extracts the value for a particular color. (this involves some java
> development)
>
> -Yonik
>
>
> On 9/28/07, Guangwei Yuan <gu...@gmail.com> wrote:
>> Hi,
>>
>> We're running an e-commerce site that provides product search.
>> We've been
>> able to extract colors from product images, and we think it'd be
>> cool and
>> useful to search products by color. A product image can have up to
>> 5 colors
>> (from a color space of about 100 colors), so we can implement it
>> easily with
>> Solr's facet search (thanks all who've developed Solr).
>>
>> The problem arises when we try to sort the results by the color
>> relevancy.
>> What's different from a normal facet search is that colors are
>> weighted. For
>> example, a black dress can have 70% of black, 20% of gray, 10% of
>> brown. A
>> search query "color:black" should return results in which the
>> black dress
>> ranks higher than other products with less percentage of black.
>>
>> My question is: how to configure and index the color field so that
>> products
>> with higher percentage of color X ranks higher for query "color:X"?
>>
>> Thanks for your help!
>>
>> - Guangwei
>>
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Color search
Posted by Yonik Seeley <yo...@apache.org>.
If it were just a couple of colors, you could have a separate field
for each color and then index the percent in that field.
black:70
grey:20
and then you could use a function query to influence the score (or you
could sort by the color percent).
However, this doesn't scale well to a large index with a large number of colors.
Each field used like that will take up 4 bytes per document in the index.
so if you have 1M documents, that's 1Mdocs * 100colors * 4bytes = 400MB
Doable depending on your index size (use "int" or "float" and not
"sint" or "sfloat" type for this... it will be better on the memory).
If you needed to be better on the memory, you could encode all of the
colors into a single value (perhaps into a compact string... one
percentile per byte or something) and then have a custom function that
extracts the value for a particular color. (this involves some java
development)
-Yonik
On 9/28/07, Guangwei Yuan <gu...@gmail.com> wrote:
> Hi,
>
> We're running an e-commerce site that provides product search. We've been
> able to extract colors from product images, and we think it'd be cool and
> useful to search products by color. A product image can have up to 5 colors
> (from a color space of about 100 colors), so we can implement it easily with
> Solr's facet search (thanks all who've developed Solr).
>
> The problem arises when we try to sort the results by the color relevancy.
> What's different from a normal facet search is that colors are weighted. For
> example, a black dress can have 70% of black, 20% of gray, 10% of brown. A
> search query "color:black" should return results in which the black dress
> ranks higher than other products with less percentage of black.
>
> My question is: how to configure and index the color field so that products
> with higher percentage of color X ranks higher for query "color:X"?
>
> Thanks for your help!
>
> - Guangwei
>
Re: Color search
Posted by Steven Rowe <sa...@syr.edu>.
Hi Renaud,
I think your method will produce strange results, probably in most
cases, e.g.
33% red #FF0000 = #550000
33% green #00FF00 = #005500
33% blue #0000FF = #000055
= #555555
Thus, red, green and blue dress would score well against a search for
"medium gray". Not good.
Steve
Renaud Waldura wrote:
> Here's another idea: encode color mixes as one RGB value (32 bits) and sort
> according to those values. To find the closest color is like finding the
> closest points in the color space. It would be like a distance search.
>
> 70% black #000000 = 0
> 20% gray #f0f0f0 = #303030
> 10% brown #8b4513 = #0e0702
> = #3e3732
>
> The distance would be:
> sqrt( (r1 - r0)^2 + (g1 - g0)^2 + (b1 - b0)^2 )
>
> Where r0g0b0 is the color the user asked for, and r1g1b1 is the composite
> color of the item, calculated above.
>
> --Renaud
>
>
> -----Original Message-----
> From: Steven Rowe [mailto:sarowe@syr.edu]
> Sent: Friday, September 28, 2007 7:14 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Color search
>
> Hi Guangwei,
>
> When you index your products, you could have a single color field, and
> include duplicates of each color component proportional to its weight.
>
> For example, if you decide to use 10% increments, for your black dress with
> 70% of black, 20% of gray, 10% of brown, you would index the following terms
> for the color field:
>
> black black black black black black black
> gray gray
> brown
>
> This works because Lucene natively interprets document term frequencies as
> weights.
>
> Steve
>
> Guangwei Yuan wrote:
>> Hi,
>>
>> We're running an e-commerce site that provides product search. We've
>> been able to extract colors from product images, and we think it'd be
>> cool and useful to search products by color. A product image can have
>> up to 5 colors (from a color space of about 100 colors), so we can
>> implement it easily with Solr's facet search (thanks all who've developed
> Solr).
>> The problem arises when we try to sort the results by the color relevancy.
>> What's different from a normal facet search is that colors are
>> weighted. For example, a black dress can have 70% of black, 20% of
>> gray, 10% of brown. A search query "color:black" should return results
>> in which the black dress ranks higher than other products with less
> percentage of black.
>> My question is: how to configure and index the color field so that
>> products with higher percentage of color X ranks higher for query
> "color:X"?
>> Thanks for your help!
>>
>> - Guangwei
>
>
RE: Color search
Posted by Renaud Waldura <re...@library.ucsf.edu>.
Here's another idea: encode color mixes as one RGB value (32 bits) and sort
according to those values. To find the closest color is like finding the
closest points in the color space. It would be like a distance search.
70% black #000000 = 0
20% gray #f0f0f0 = #303030
10% brown #8b4513 = #0e0702
= #3e3732
The distance would be:
sqrt( (r1 - r0)^2 + (g1 - g0)^2 + (b1 - b0)^2 )
Where r0g0b0 is the color the user asked for, and r1g1b1 is the composite
color of the item, calculated above.
--Renaud
-----Original Message-----
From: Steven Rowe [mailto:sarowe@syr.edu]
Sent: Friday, September 28, 2007 7:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Color search
Hi Guangwei,
When you index your products, you could have a single color field, and
include duplicates of each color component proportional to its weight.
For example, if you decide to use 10% increments, for your black dress with
70% of black, 20% of gray, 10% of brown, you would index the following terms
for the color field:
black black black black black black black
gray gray
brown
This works because Lucene natively interprets document term frequencies as
weights.
Steve
Guangwei Yuan wrote:
> Hi,
>
> We're running an e-commerce site that provides product search. We've
> been able to extract colors from product images, and we think it'd be
> cool and useful to search products by color. A product image can have
> up to 5 colors (from a color space of about 100 colors), so we can
> implement it easily with Solr's facet search (thanks all who've developed
Solr).
>
> The problem arises when we try to sort the results by the color relevancy.
> What's different from a normal facet search is that colors are
> weighted. For example, a black dress can have 70% of black, 20% of
> gray, 10% of brown. A search query "color:black" should return results
> in which the black dress ranks higher than other products with less
percentage of black.
>
> My question is: how to configure and index the color field so that
> products with higher percentage of color X ranks higher for query
"color:X"?
>
> Thanks for your help!
>
> - Guangwei
Re: Color search
Posted by Steven Rowe <sa...@syr.edu>.
Hi Guangwei,
When you index your products, you could have a single color field, and
include duplicates of each color component proportional to its weight.
For example, if you decide to use 10% increments, for your black dress
with 70% of black, 20% of gray, 10% of brown, you would index the
following terms for the color field:
black black black black black black black
gray gray
brown
This works because Lucene natively interprets document term frequencies
as weights.
Steve
Guangwei Yuan wrote:
> Hi,
>
> We're running an e-commerce site that provides product search. We've been
> able to extract colors from product images, and we think it'd be cool and
> useful to search products by color. A product image can have up to 5 colors
> (from a color space of about 100 colors), so we can implement it easily with
> Solr's facet search (thanks all who've developed Solr).
>
> The problem arises when we try to sort the results by the color relevancy.
> What's different from a normal facet search is that colors are weighted. For
> example, a black dress can have 70% of black, 20% of gray, 10% of brown. A
> search query "color:black" should return results in which the black dress
> ranks higher than other products with less percentage of black.
>
> My question is: how to configure and index the color field so that products
> with higher percentage of color X ranks higher for query "color:X"?
>
> Thanks for your help!
>
> - Guangwei
Re: Color search
Posted by Chris Hostetter <ho...@fucit.org>.
: useful to search products by color. A product image can have up to 5 colors
: (from a color space of about 100 colors), so we can implement it easily with
: Solr's facet search (thanks all who've developed Solr).
:
: The problem arises when we try to sort the results by the color relevancy.
: What's different from a normal facet search is that colors are weighted. For
: example, a black dress can have 70% of black, 20% of gray, 10% of brown. A
if 5 is a hard max on the number of colors that you support, then you can
always use 5 seperate fields to store the colors in order of "dominance"
and then query on those 5 fields with varying boosts...
color_1:black^10 color_2:black^7 color_3:black^4 color_4:black color_5:black^0.1
...something like this will loose the % granularity info that you have (so
a 60% black skirt and an 80% black dress would both score the same against
black since it's hte dominant color)
alternately: i'm assuming your percentage data only has so much confidence
-- maybe on the order of 10%?. you can have a seperate field for each
"bucket" of color percentages and index the name of hte color in the
corrisponding bucket. with 10% granularity that's only 10 fields -- a 10
clause boolean query for the color is no big deal ... even going to 5%
would be trivial.
Incidently: people interested in teh general topic of color faceting at
a finer granularity then just color names may want to check out this
thread from last...
http://www.nabble.com/faceting-and-categorizing-on-color--tf1801106.html
-Hoss