You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Nejla Karacan <ne...@tu-dortmund.de> on 2012/01/26 18:18:48 UTC

Solr and TF-IDF

Hey there,

I'm using Solr for my thesis, where I have to implement a content-based
recommender system for movies.

I have indexed about 20thousand movies with their informations:
movie-id
title
genre
plot/movie-description <- !!!
cast

I've enabled the TermvektorComponent for the fields genre, description and
cast.
So I can get the tf-idf-values for the terms of every movie.

With these term-TfIdfValue-couples I have to compute the similarities
between movies by using the cosine similarity.
I know about the Solr-Feature MLT (MoreLikeThis), but thats not the
solution, I have to
implement the CosineSimilarity in java myself.

Now I have some problems/questions:
I get the responses in XML-format, which I read out with an XML-reader in
Java,
where it wriggle trough every child-node in order to reach the right node.
Is there a better way, to get these values in Node-Attributes or node-texts?
I have tried it with wt=csv but for the requests I get
responses only with the Movie-ID's, nothing more.
By XML-responseWriter my request is for example this:
http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true
I get the right response with all terms and tf-tdf's - in xml.

And if I add csv-notation
http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true&wt=csv
I get only this:
id
1800180382

Maybe my request is wrong?

Another problem is, if I get the terms and their tfidf-values, I store
them in a map.
But there isn't a succession in the values. I want e.g. store only the 10
chief terms,
so 10 terms with the highest tfidf-values. Can I sort them in a descending
succession?
I haven't find anything therefor. If its not possible, I must sort them
later in the map.

My last question is:
any movie has a genre - often more than one.
Its like the "cat"-field (category) in the exampledocs with ipod/monitor
etc. and its an important pointfor the movies.
How can I integrate this factor?
I changed the boost-attribute in the Solr-Xml-Schema like this:
<field name="genre" type="string" indexed="true" stored="true"
multiValued="true" omitNorms="false" boost="3" termVectors="true"
termPositions="true" termOffsets="true"/>
Is that enough or is there any other possibility?

Perhaps you see, that I am a beginner in Solr,
at the beginning a few weeks ago it was even more difficult for me but now
it goes better.
I would be very grateful for any help, ideas, tips or suggestions!

Many regards
Nejla

Re: Solr and TF-IDF

Posted by Lee Carroll <le...@googlemail.com>.

"content-based recommender"  so its not CF etc
and its a project so its whatever his supervisor wants.

take a look at solrj should be more natural to integrate your java code with.

(Although not sure if it supports termv ector comp)

good luck



On 26 January 2012 17:27, Walter Underwood <wu...@wunderwood.org> wrote:
> Why are you using a search engine to build a recomender? None of the leading teams in the Netflix Prize used search engines as a base technology.
>
> Start with the recommender algorithms in Mahout: http://mahout.apache.org/
>
> wunder
>
> On Jan 26, 2012, at 9:18 AM, Nejla Karacan wrote:
>
>> Hey there,
>>
>> I'm using Solr for my thesis, where I have to implement a content-based
>> recommender system for movies.
>>
>> I have indexed about 20thousand movies with their informations:
>> movie-id
>> title
>> genre
>> plot/movie-description <- !!!
>> cast
>>
>> I've enabled the TermvektorComponent for the fields genre, description and
>> cast.
>> So I can get the tf-idf-values for the terms of every movie.
>>
>> With these term-TfIdfValue-couples I have to compute the similarities
>> between movies by using the cosine similarity.
>> I know about the Solr-Feature MLT (MoreLikeThis), but thats not the
>> solution, I have to
>> implement the CosineSimilarity in java myself.
>>
>> Now I have some problems/questions:
>> I get the responses in XML-format, which I read out with an XML-reader in
>> Java,
>> where it wriggle trough every child-node in order to reach the right node.
>> Is there a better way, to get these values in Node-Attributes or node-texts?
>> I have tried it with wt=csv but for the requests I get
>> responses only with the Movie-ID's, nothing more.
>> By XML-responseWriter my request is for example this:
>> http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true
>> I get the right response with all terms and tf-tdf's - in xml.
>>
>> And if I add csv-notation
>> http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true&wt=csv
>> I get only this:
>> id
>> 1800180382
>>
>> Maybe my request is wrong?
>>
>> Another problem is, if I get the terms and their tfidf-values, I store
>> them in a map.
>> But there isn't a succession in the values. I want e.g. store only the 10
>> chief terms,
>> so 10 terms with the highest tfidf-values. Can I sort them in a descending
>> succession?
>> I haven't find anything therefor. If its not possible, I must sort them
>> later in the map.
>>
>> My last question is:
>> any movie has a genre - often more than one.
>> Its like the "cat"-field (category) in the exampledocs with ipod/monitor
>> etc. and its an important pointfor the movies.
>> How can I integrate this factor?
>> I changed the boost-attribute in the Solr-Xml-Schema like this:
>> <field name="genre" type="string" indexed="true" stored="true"
>> multiValued="true" omitNorms="false" boost="3" termVectors="true"
>> termPositions="true" termOffsets="true"/>
>> Is that enough or is there any other possibility?
>>
>> Perhaps you see, that I am a beginner in Solr,
>> at the beginning a few weeks ago it was even more difficult for me but now
>> it goes better.
>> I would be very grateful for any help, ideas, tips or suggestions!
>>
>> Many regards
>> Nejla
>>
>
>
>

Re: Solr and TF-IDF

Posted by Walter Underwood <wu...@wunderwood.org>.

Why are you using a search engine to build a recomender? None of the leading teams in the Netflix Prize used search engines as a base technology.

Start with the recommender algorithms in Mahout: http://mahout.apache.org/

wunder

On Jan 26, 2012, at 9:18 AM, Nejla Karacan wrote:

> Hey there,
> 
> I'm using Solr for my thesis, where I have to implement a content-based
> recommender system for movies.
> 
> I have indexed about 20thousand movies with their informations:
> movie-id
> title
> genre
> plot/movie-description <- !!!
> cast
> 
> I've enabled the TermvektorComponent for the fields genre, description and
> cast.
> So I can get the tf-idf-values for the terms of every movie.
> 
> With these term-TfIdfValue-couples I have to compute the similarities
> between movies by using the cosine similarity.
> I know about the Solr-Feature MLT (MoreLikeThis), but thats not the
> solution, I have to
> implement the CosineSimilarity in java myself.
> 
> Now I have some problems/questions:
> I get the responses in XML-format, which I read out with an XML-reader in
> Java,
> where it wriggle trough every child-node in order to reach the right node.
> Is there a better way, to get these values in Node-Attributes or node-texts?
> I have tried it with wt=csv but for the requests I get
> responses only with the Movie-ID's, nothing more.
> By XML-responseWriter my request is for example this:
> http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true
> I get the right response with all terms and tf-tdf's - in xml.
> 
> And if I add csv-notation
> http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true&wt=csv
> I get only this:
> id
> 1800180382
> 
> Maybe my request is wrong?
> 
> Another problem is, if I get the terms and their tfidf-values, I store
> them in a map.
> But there isn't a succession in the values. I want e.g. store only the 10
> chief terms,
> so 10 terms with the highest tfidf-values. Can I sort them in a descending
> succession?
> I haven't find anything therefor. If its not possible, I must sort them
> later in the map.
> 
> My last question is:
> any movie has a genre - often more than one.
> Its like the "cat"-field (category) in the exampledocs with ipod/monitor
> etc. and its an important pointfor the movies.
> How can I integrate this factor?
> I changed the boost-attribute in the Solr-Xml-Schema like this:
> <field name="genre" type="string" indexed="true" stored="true"
> multiValued="true" omitNorms="false" boost="3" termVectors="true"
> termPositions="true" termOffsets="true"/>
> Is that enough or is there any other possibility?
> 
> Perhaps you see, that I am a beginner in Solr,
> at the beginning a few weeks ago it was even more difficult for me but now
> it goes better.
> I would be very grateful for any help, ideas, tips or suggestions!
> 
> Many regards
> Nejla
>