You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2014/05/05 08:47:12 UTC

Re: Solr relevancy tuning

One good thing about kelvin it's more a programmatic task, so you could execute the scripts after a few changes/deployment and get a general idea if the new changes has impacted into the search experience; yeah sure the changing catalog it's still a problem but I kind of like to be able to execute a few commands and presto get it done. This could become a must-run test in the test suite of the app. I kind of do this already but testing from the user interface, using the test library provided by symfony2 (framework I'm using) and the functional tests. It's not test-driven-search-relevancy "perse" but we ensure not to mess up with some basic queries we use to test the search feature.

----- Original Message -----
From: "Giovanni Bricconi" <gi...@banzai.it>
To: "solr-user" <so...@lucene.apache.org>
Cc: "Ahmet Arslan" <io...@yahoo.com>
Sent: Friday, April 11, 2014 5:15:56 AM
Subject: Re: Solr relevancy tuning

Hello Doug

I have just watched the quepid demonstration video, and I strongly agree
with your introduction: it is very hard to involve marketing/business
people in repeated testing session, and speadsheets or other kind of files
are not the right tool to use.
Currenlty I'm quite alone in my tuning task and having a visual approach
could be benefical for me, you are giving me many good inputs!

I see that kelvin (my scripted tool) and queepid follows the same path. In
queepid someone quickly whatches the results and applies colours to result,
in kelvin you enter one on more queries (network cable, ethernet cable) and
states that the result must contains ethernet in the title, or must come
from a list of product categories.

I also do diffs of results, before and after changes, to check what is
going on; but I have to do that in a very unix-scripted way.

Have you considered of placing a counter of total red/bad results in
quepid? I use this index to have a quick overview of changes impact across
all queries. Actually I repeat tests in production from times to time, and
if I see the "kelvin temperature" rising (the number of errors going up) I
know I have to check what's going on because new products maybe are having
a bad impact on the index.

I also keep counters of products with low quality images/no images at all
or too short listings, sometimes are useful to undestand better what will
happen if you change some bq/fq in the application.

I see also that after changes in quepid someone have to check "gray"
results and assign them a colour, in kelvin case sometimes the conditions
can do a bit of magic (new product names still contains SM-G900F) but
sometimes can introduce false errors (the new product name contains only
Galaxy 5 and not the product code SM-G900F). So some checks are needed but
with quepid everybody can do the check, with kelvin you have to change some
line of a script, and not everybody is able/willing to do that.

The idea of a static index is a good suggestion, I will try to have it in
the next round of search engine improvement.

Thank you Doug!




2014-04-09 17:48 GMT+02:00 Doug Turnbull <
dturnbull@opensourceconnections.com>:

> Hey Giovanni, nice to meet you.
>
> I'm the person that did the Test Driven Relevancy talk. We've got a product
> Quepid (http://quepid.com) that lets you gather good/bad results for
> queries and do a sort of test driven development against search relevancy.
> Sounds similar to your existing scripted approach. Have you considered
> keeping a static catalog for testing purposes? We had a project with a lot
> of updates and date-dependent relevancy. This lets you create some test
> scenarios against a static data set. However, one downside is you can't
> recreate problems in production in your test setup exactly-- you have to
> find a similar issue that reflects what you're seeing.
>
> Cheers,
> -Doug
>
>
> On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi <
> giovanni.bricconi@banzai.it> wrote:
>
> > Thank you for the links.
> >
> > The book is really useful, I will definitively have to spend some time
> > reformatting the logs to to access number of result founds, session id
> and
> > much more.
> >
> > I'm also quite happy that my test cases produces similar results to the
> > precision reports shown at the beginning of the book.
> >
> > Giovanni
> >
> >
> > 2014-04-09 12:59 GMT+02:00 Ahmet Arslan <io...@yahoo.com>:
> >
> > > Hi Giovanni,
> > >
> > > Here are some relevant pointers :
> > >
> > >
> > >
> >
> http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy
> > >
> > >
> > > http://rosenfeldmedia.com/books/search-analytics/
> > >
> > > http://www.sematext.com/search-analytics/index.html
> > >
> > >
> > > Ahmet
> > >
> > >
> > > On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi <
> > > giovanni.bricconi@banzai.it> wrote:
> > > It is about one year I'm working on an e-commerce site, and
> > unfortunately I
> > > have no "information retrieval" background, so probably I am missing
> some
> > > important practices about relevance tuning and search engines.
> > > During this period I had to fix many "bugs" about bad search results,
> > which
> > > I have solved sometimes tuning edismax weights, sometimes creating ad
> hoc
> > > query filters or query boosting; but I am still not able to figure out
> > what
> > > should be the correct process to improve search results relevance.
> > >
> > > These are the practices I am following, I would really appreciate any
> > > comments about them and any hints about what practices you follow in
> your
> > > projects:
> > >
> > > - In order to have a measure of search quality I have written many test
> > > cases such as "if the user searches for <<nike sport watch>> the search
> > > result should display at least four <<tom tom>> products with the words
> > > <<nike>> and <<sportwatch>> in the title". I have written a tool that
> > read
> > > such tests from json files and applies them to my applications, and
> then
> > > counts the number of results that does not match the criterias stated
> in
> > > the test cases. (for those interested this tool is available at
> > > https://github.com/gibri/kelvin but it is still quite a prototype)
> > >
> > > - I use this count as a quality index, I tried various times to change
> > the
> > > edismax weight to lower the whole number of error, or to add new
> > > filters/boostings to the application to try to decrease the error
> count.
> > >
> > > - The pros of this is that at least you have a number to look at, and
> > that
> > > you have a quick way of checking the impact of a modification.
> > >
> > > - The bad side is that you have to maintain the test cases: now I have
> > > about 800 tests and my product catalogue changes often, this implies
> that
> > > some products exits the catalog, and some test cases cant pass anymore.
> > >
> > > - I am populating the test cases using errors reported from users, and
> I
> > > feel that this is driving the test cases too much toward pathologic
> > cases.
> > > An more over I haven't many test for cases that are working well now.
> > >
> > > I would like to use search logs as drivers to generate tests, but I
> feel
> > I
> > > haven't picked the right path. Using top queries, manually reviewing
> > > results, and then writing tests is a slow process; moreover many top
> > > queries are ambiguous or are driven by site ads.
> > >
> > > Many many queries are unique per users. How to deal with these cases?
> > >
> > > How are you using your log to find out test cases to fix? Are you
> looking
> > > for queries where the user is not "opening" any returned results? Which
> > kpi
> > > have you chosen to find out query that are not providing good results?
> > And
> > > what are you using as kpi for the whole search, beside the conversion
> > rate?
> > >
> > > Can you suggest me any other practices you are using on your projects?
> > >
> > > Thank you very much in advance
> > >
> > > Giovanni
> > >
> > >
> >
>
>
>
> --
> Doug Turnbull
> Search & Big Data Architect
> OpenSource Connections <http://o19s.com>
>
________________________________________________________________________________________________
I Conferencia Científica Internacional UCIENCIA 2014 en la UCI del 24 al 26 de abril de 2014, La Habana, Cuba. Ver http://uciencia.uci.cu

Re: Solr relevancy tuning

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
I realize I never responded to this thread, shame on me!

Jorge/Giovanni Kelvin looks pretty cool -- thanks for sharing it. When we
use Quepid ,we sometimes do it at places with existing relevancy test
scripts like Kelvin. Quepid/test scripts tend to satisfy different nitches.
In addition to testing, Quepid is a GUI for helping you explain/investigate
and sandbox in addition to test. Sometimes this is nice for fuzzier/more
qualitative judgments especially when you want to collaborate with
non-technical stakeholders. Its been our replacement for the "spreadsheet"
that a lot of our clients used before Quepid -- where the non-technical
folks would list

Scripts work very well for getting that pass/fail response. Its nice that
Kelvin gives you a "temperature" instead of necessarily a pass fail, that
level of fuzzyness is definitely useful.

We certainly see value in both (and will probably be doing more to
integrate Quepid with continuous integration/scripting).

Cheers,
-Doug


On Mon, May 5, 2014 at 2:47 AM, Jorge Luis Betancourt González <
jlbetancourt@uci.cu> wrote:

> One good thing about kelvin it's more a programmatic task, so you could
> execute the scripts after a few changes/deployment and get a general idea
> if the new changes has impacted into the search experience; yeah sure the
> changing catalog it's still a problem but I kind of like to be able to
> execute a few commands and presto get it done. This could become a must-run
> test in the test suite of the app. I kind of do this already but testing
> from the user interface, using the test library provided by symfony2
> (framework I'm using) and the functional tests. It's not
> test-driven-search-relevancy "perse" but we ensure not to mess up with some
> basic queries we use to test the search feature.
>
> ----- Original Message -----
> From: "Giovanni Bricconi" <gi...@banzai.it>
> To: "solr-user" <so...@lucene.apache.org>
> Cc: "Ahmet Arslan" <io...@yahoo.com>
> Sent: Friday, April 11, 2014 5:15:56 AM
> Subject: Re: Solr relevancy tuning
>
> Hello Doug
>
> I have just watched the quepid demonstration video, and I strongly agree
> with your introduction: it is very hard to involve marketing/business
> people in repeated testing session, and speadsheets or other kind of files
> are not the right tool to use.
> Currenlty I'm quite alone in my tuning task and having a visual approach
> could be benefical for me, you are giving me many good inputs!
>
> I see that kelvin (my scripted tool) and queepid follows the same path. In
> queepid someone quickly whatches the results and applies colours to result,
> in kelvin you enter one on more queries (network cable, ethernet cable) and
> states that the result must contains ethernet in the title, or must come
> from a list of product categories.
>
> I also do diffs of results, before and after changes, to check what is
> going on; but I have to do that in a very unix-scripted way.
>
> Have you considered of placing a counter of total red/bad results in
> quepid? I use this index to have a quick overview of changes impact across
> all queries. Actually I repeat tests in production from times to time, and
> if I see the "kelvin temperature" rising (the number of errors going up) I
> know I have to check what's going on because new products maybe are having
> a bad impact on the index.
>
> I also keep counters of products with low quality images/no images at all
> or too short listings, sometimes are useful to undestand better what will
> happen if you change some bq/fq in the application.
>
> I see also that after changes in quepid someone have to check "gray"
> results and assign them a colour, in kelvin case sometimes the conditions
> can do a bit of magic (new product names still contains SM-G900F) but
> sometimes can introduce false errors (the new product name contains only
> Galaxy 5 and not the product code SM-G900F). So some checks are needed but
> with quepid everybody can do the check, with kelvin you have to change some
> line of a script, and not everybody is able/willing to do that.
>
> The idea of a static index is a good suggestion, I will try to have it in
> the next round of search engine improvement.
>
> Thank you Doug!
>
>
>
>
> 2014-04-09 17:48 GMT+02:00 Doug Turnbull <
> dturnbull@opensourceconnections.com>:
>
> > Hey Giovanni, nice to meet you.
> >
> > I'm the person that did the Test Driven Relevancy talk. We've got a
> product
> > Quepid (http://quepid.com) that lets you gather good/bad results for
> > queries and do a sort of test driven development against search
> relevancy.
> > Sounds similar to your existing scripted approach. Have you considered
> > keeping a static catalog for testing purposes? We had a project with a
> lot
> > of updates and date-dependent relevancy. This lets you create some test
> > scenarios against a static data set. However, one downside is you can't
> > recreate problems in production in your test setup exactly-- you have to
> > find a similar issue that reflects what you're seeing.
> >
> > Cheers,
> > -Doug
> >
> >
> > On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi <
> > giovanni.bricconi@banzai.it> wrote:
> >
> > > Thank you for the links.
> > >
> > > The book is really useful, I will definitively have to spend some time
> > > reformatting the logs to to access number of result founds, session id
> > and
> > > much more.
> > >
> > > I'm also quite happy that my test cases produces similar results to the
> > > precision reports shown at the beginning of the book.
> > >
> > > Giovanni
> > >
> > >
> > > 2014-04-09 12:59 GMT+02:00 Ahmet Arslan <io...@yahoo.com>:
> > >
> > > > Hi Giovanni,
> > > >
> > > > Here are some relevant pointers :
> > > >
> > > >
> > > >
> > >
> >
> http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy
> > > >
> > > >
> > > > http://rosenfeldmedia.com/books/search-analytics/
> > > >
> > > > http://www.sematext.com/search-analytics/index.html
> > > >
> > > >
> > > > Ahmet
> > > >
> > > >
> > > > On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi <
> > > > giovanni.bricconi@banzai.it> wrote:
> > > > It is about one year I'm working on an e-commerce site, and
> > > unfortunately I
> > > > have no "information retrieval" background, so probably I am missing
> > some
> > > > important practices about relevance tuning and search engines.
> > > > During this period I had to fix many "bugs" about bad search results,
> > > which
> > > > I have solved sometimes tuning edismax weights, sometimes creating ad
> > hoc
> > > > query filters or query boosting; but I am still not able to figure
> out
> > > what
> > > > should be the correct process to improve search results relevance.
> > > >
> > > > These are the practices I am following, I would really appreciate any
> > > > comments about them and any hints about what practices you follow in
> > your
> > > > projects:
> > > >
> > > > - In order to have a measure of search quality I have written many
> test
> > > > cases such as "if the user searches for <<nike sport watch>> the
> search
> > > > result should display at least four <<tom tom>> products with the
> words
> > > > <<nike>> and <<sportwatch>> in the title". I have written a tool that
> > > read
> > > > such tests from json files and applies them to my applications, and
> > then
> > > > counts the number of results that does not match the criterias stated
> > in
> > > > the test cases. (for those interested this tool is available at
> > > > https://github.com/gibri/kelvin but it is still quite a prototype)
> > > >
> > > > - I use this count as a quality index, I tried various times to
> change
> > > the
> > > > edismax weight to lower the whole number of error, or to add new
> > > > filters/boostings to the application to try to decrease the error
> > count.
> > > >
> > > > - The pros of this is that at least you have a number to look at, and
> > > that
> > > > you have a quick way of checking the impact of a modification.
> > > >
> > > > - The bad side is that you have to maintain the test cases: now I
> have
> > > > about 800 tests and my product catalogue changes often, this implies
> > that
> > > > some products exits the catalog, and some test cases cant pass
> anymore.
> > > >
> > > > - I am populating the test cases using errors reported from users,
> and
> > I
> > > > feel that this is driving the test cases too much toward pathologic
> > > cases.
> > > > An more over I haven't many test for cases that are working well now.
> > > >
> > > > I would like to use search logs as drivers to generate tests, but I
> > feel
> > > I
> > > > haven't picked the right path. Using top queries, manually reviewing
> > > > results, and then writing tests is a slow process; moreover many top
> > > > queries are ambiguous or are driven by site ads.
> > > >
> > > > Many many queries are unique per users. How to deal with these cases?
> > > >
> > > > How are you using your log to find out test cases to fix? Are you
> > looking
> > > > for queries where the user is not "opening" any returned results?
> Which
> > > kpi
> > > > have you chosen to find out query that are not providing good
> results?
> > > And
> > > > what are you using as kpi for the whole search, beside the conversion
> > > rate?
> > > >
> > > > Can you suggest me any other practices you are using on your
> projects?
> > > >
> > > > Thank you very much in advance
> > > >
> > > > Giovanni
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Doug Turnbull
> > Search & Big Data Architect
> > OpenSource Connections <http://o19s.com>
> >
>
> ________________________________________________________________________________________________
> I Conferencia Científica Internacional UCIENCIA 2014 en la UCI del 24 al
> 26 de abril de 2014, La Habana, Cuba. Ver http://uciencia.uci.cu
>



-- 
Doug Turnbull
Search & Big Data Architect
OpenSource Connections <http://o19s.com>