You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joe Obernberger <jo...@gmail.com> on 2017/08/22 16:32:36 UTC

Machine Learning for search

Hi All - One of the really neat features of solr 6 is the ability to 
create machine learning models (information gain) and then use those 
models as a query.  If I want a user to be able to execute a query for 
the text Hawaii and use a machine learning model related to weather 
data, how can I correctly rank the results?  It looks like I would need 
to classify all the documents in some date range (assuming the query is 
date restricted), look at the probability_d and pick the top n 
documents.  Is there a better way to do this?

I'm using a stream like this:
classify(model(models,id="WeatherModel",cacheMillis=5000),search(COL1,df="FULL_DOCUMENT",q="Hawaii 
AND DocTimestamp:[2017-07-23T04:00:00Z TO 
2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id 
asc",rows="10000"),field="ClusterText")

This sends this to all the shards who can return at most 10,000 docs each.

Thanks!

-Joe

Fwd: Machine Learning for search

Posted by Joel Bernstein <jo...@gmail.com>.
I forgot to include the users list in my response below:
---------------

Interesting. I've been meaning to test the classifier in a similar way but
haven't had the time.

Basically what you did is created two classes:

1) A positive class
2) A very noisy negative class of "other stuff"

It was unclear from my reading on logistic regression whether this would
actually work. So I'm excited to hear that the classifier is indeed
providing good results with a noisy negative class, because this is a very
useful scenario.

One thing you may want to consider is taking some features from the model
and using them at query time. This would provide results that are better
candidates to fit the model and then you may not have to rerank such a
large set.








Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 23, 2017 at 6:02 PM, Joe Obernberger <
joseph.obernberger@gmail.com> wrote:

> Thank you Joel.  I'm really having a good time with the machine learning
> component in Solr.  In this case, the weather model was built by
> classifying tweets as positive or negative.  I started by searching for
> tweets with terms like tornado, storm, forecast, typhoon, hurricane,
> blizzard, snow, lightning, flood warning, etc.. and making those positive.
> Then I grabbed some randoms tweets about Trump, ISIS, Kardashian, etc. to
> make negative tweets.  At that point I started to classify data and refine
> the model (adding more positive/negative) as more data came into the system.
>
> I hope that helps.  The model works very well at this point with just 650
> tweets manually classified (pos/neg about split even) and using 150 terms.
>
> I like your idea about using the model to re-rank the top n search
> results.  That said, the results can be significantly 'better' if I
> classify more data and reorder based on high probability scores; but as you
> pointed out at the cost of much slower searches.  In some cases, I would
> suspect a user may want to search just with a model and without any search
> terms, but in those cases it may be best to classify data as it comes in.
> I guess it's a toss up between what is more important - high probability
> from the classifier vs high rank from the search engine.
> Thanks Joel.
>
> -Joe
>
>
>
> On 8/23/2017 3:08 PM, Joel Bernstein wrote:
>
>> Can you describe the weather model?
>>
>> In general the idea is to rerank the top N docs, because it will be too
>> slow to classify the whole result set.
>>
>> In this scenario the search engine ranking will already be returning
>> relevant candidate documents and the model is only used to get a better
>> ordering of the top docs.
>>
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
>> joseph.obernberger@gmail.com> wrote:
>>
>> Hi All - One of the really neat features of solr 6 is the ability to
>>> create machine learning models (information gain) and then use those
>>> models
>>> as a query.  If I want a user to be able to execute a query for the text
>>> Hawaii and use a machine learning model related to weather data, how can
>>> I
>>> correctly rank the results?  It looks like I would need to classify all
>>> the
>>> documents in some date range (assuming the query is date restricted),
>>> look
>>> at the probability_d and pick the top n documents.  Is there a better way
>>> to do this?
>>>
>>> I'm using a stream like this:
>>> classify(model(models,id="WeatherModel",cacheMillis=5000),
>>> search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
>>> DocTimestamp:[2017-07-23T04:00:00Z TO 2017-08-23T03:59:00Z]",fl="Clu
>>> sterText,id",sort="id
>>> asc",rows="10000"),field="ClusterText")
>>>
>>> This sends this to all the shards who can return at most 10,000 docs
>>> each.
>>>
>>> Thanks!
>>>
>>> -Joe
>>>
>>>
>>>
>> ---
>> This email has been checked for viruses by AVG.
>> http://www.avg.com
>>
>>
>

Re: Machine Learning for search

Posted by Joe Obernberger <jo...@gmail.com>.
Thank you Joel.  I'm really having a good time with the machine learning 
component in Solr.  In this case, the weather model was built by 
classifying tweets as positive or negative.  I started by searching for 
tweets with terms like tornado, storm, forecast, typhoon, hurricane, 
blizzard, snow, lightning, flood warning, etc.. and making those 
positive.  Then I grabbed some randoms tweets about Trump, ISIS, 
Kardashian, etc. to make negative tweets.  At that point I started to 
classify data and refine the model (adding more positive/negative) as 
more data came into the system.

I hope that helps.  The model works very well at this point with just 
650 tweets manually classified (pos/neg about split even) and using 150 
terms.

I like your idea about using the model to re-rank the top n search 
results.  That said, the results can be significantly 'better' if I 
classify more data and reorder based on high probability scores; but as 
you pointed out at the cost of much slower searches.  In some cases, I 
would suspect a user may want to search just with a model and without 
any search terms, but in those cases it may be best to classify data as 
it comes in.  I guess it's a toss up between what is more important - 
high probability from the classifier vs high rank from the search engine.
Thanks Joel.

-Joe


On 8/23/2017 3:08 PM, Joel Bernstein wrote:
> Can you describe the weather model?
>
> In general the idea is to rerank the top N docs, because it will be too
> slow to classify the whole result set.
>
> In this scenario the search engine ranking will already be returning
> relevant candidate documents and the model is only used to get a better
> ordering of the top docs.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
> joseph.obernberger@gmail.com> wrote:
>
>> Hi All - One of the really neat features of solr 6 is the ability to
>> create machine learning models (information gain) and then use those models
>> as a query.  If I want a user to be able to execute a query for the text
>> Hawaii and use a machine learning model related to weather data, how can I
>> correctly rank the results?  It looks like I would need to classify all the
>> documents in some date range (assuming the query is date restricted), look
>> at the probability_d and pick the top n documents.  Is there a better way
>> to do this?
>>
>> I'm using a stream like this:
>> classify(model(models,id="WeatherModel",cacheMillis=5000),
>> search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
>> DocTimestamp:[2017-07-23T04:00:00Z TO 2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id
>> asc",rows="10000"),field="ClusterText")
>>
>> This sends this to all the shards who can return at most 10,000 docs each.
>>
>> Thanks!
>>
>> -Joe
>>
>>
>
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>


Re: Machine Learning for search

Posted by Joel Bernstein <jo...@gmail.com>.
Can you describe the weather model?

In general the idea is to rerank the top N docs, because it will be too
slow to classify the whole result set.

In this scenario the search engine ranking will already be returning
relevant candidate documents and the model is only used to get a better
ordering of the top docs.



Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
joseph.obernberger@gmail.com> wrote:

> Hi All - One of the really neat features of solr 6 is the ability to
> create machine learning models (information gain) and then use those models
> as a query.  If I want a user to be able to execute a query for the text
> Hawaii and use a machine learning model related to weather data, how can I
> correctly rank the results?  It looks like I would need to classify all the
> documents in some date range (assuming the query is date restricted), look
> at the probability_d and pick the top n documents.  Is there a better way
> to do this?
>
> I'm using a stream like this:
> classify(model(models,id="WeatherModel",cacheMillis=5000),
> search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
> DocTimestamp:[2017-07-23T04:00:00Z TO 2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id
> asc",rows="10000"),field="ClusterText")
>
> This sends this to all the shards who can return at most 10,000 docs each.
>
> Thanks!
>
> -Joe
>
>