You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Arian Pasquali <ar...@arianpasquali.com> on 2014/09/24 03:53:03 UTC

word weights using BM25

Hi,
I was wondering if would be possible to support bm25 term weighting
extending Mahout's tf-idf implementation.

I was curious to know if anyone here has already tried to do so.
If not, what would be your suggestion for such implementation on Mahout?


Arian Pasquali
http://about.me/arianpasquali

Re: word weights using BM25

Posted by Ted Dunning <te...@gmail.com>.
Marko,

Sorry to be non-responsive.

There is not a good user manual for the streaming k-means software and
there are some known scaling pathologies with that code.

I myself know some about it, but lack the time currently to provide
detailed support.

Can you remind me what your interest is?  Is this research?  Or looking for
something more industrial?



On Wed, Sep 24, 2014 at 8:34 AM, Marko <ma...@nissatech.com> wrote:

> Hello everyone,
>
> I'm very sorry to bump in like this, I have been added to the mail list (I
> think), but it seems that I'm somehow unable to ask a question, that is, I
> asked a question full times and got no answer. I hope this way will work.
>
> I'm new to Mahout and I've been struggling with Streaming K-means for a
> while now. Is there any tutorial or example of how to use it, how to get
> results, how to call clustering function?
>
> Any help would be great,
> Thanks
>
>
> On 24.09.2014. 15:14, Arian Pasquali wrote:
>
>> Yes,
>> I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
>> current mahout's tfidf code.
>> Trying to understand how I would port that to mr.
>> I ll try to share something if I succeed.
>>
>>
>>
>>
>>
>> Arian Pasquali
>> http://about.me/arianpasquali
>>
>> 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
>>
>>  Lucene 4.x supports okapi-bm25. So it should be easy to implement.
>>>
>>> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>  Should be pretty easy. I haven't heard of anyone doing it.
>>>>
>>>> Sent from my iPhone
>>>>
>>>>  On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
>>>>>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I was wondering if would be possible to support bm25 term weighting
>>>>> extending Mahout's tf-idf implementation.
>>>>>
>>>>> I was curious to know if anyone here has already tried to do so.
>>>>> If not, what would be your suggestion for such implementation on
>>>>>
>>>> Mahout?
>>>
>>>>
>>>>> Arian Pasquali
>>>>> http://about.me/arianpasquali
>>>>>
>>>>
>

Re: word weights using BM25

Posted by Ted Dunning <te...@gmail.com>.
Marko,

Suneel's answer is much better than mine.

On Wed, Sep 24, 2014 at 10:10 PM, Suneel Marthi <su...@gmail.com>
wrote:

> @Marko, Subject: Streaming KMeans
>
> See
>
> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means/18090471#18090471
> for how to invoke Streaming Kmeans
>
> Also look at examples/bin/cluster-reuters.sh for the Streaming KMeans
> option.
>
>
> On Wed, Sep 24, 2014 at 11:34 AM, Marko <ma...@nissatech.com> wrote:
>
> > Hello everyone,
> >
> > I'm very sorry to bump in like this, I have been added to the mail list
> (I
> > think), but it seems that I'm somehow unable to ask a question, that is,
> I
> > asked a question full times and got no answer. I hope this way will work.
> >
> > I'm new to Mahout and I've been struggling with Streaming K-means for a
> > while now. Is there any tutorial or example of how to use it, how to get
> > results, how to call clustering function?
> >
> > Any help would be great,
> > Thanks
> >
> >
> > On 24.09.2014. 15:14, Arian Pasquali wrote:
> >
> >> Yes,
> >> I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and
> the
> >> current mahout's tfidf code.
> >> Trying to understand how I would port that to mr.
> >> I ll try to share something if I succeed.
> >>
> >>
> >>
> >>
> >>
> >> Arian Pasquali
> >> http://about.me/arianpasquali
> >>
> >> 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
> >>
> >>  Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> >>>
> >>> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>  Should be pretty easy. I haven't heard of anyone doing it.
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>>  On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
> >>>>>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>> I was wondering if would be possible to support bm25 term weighting
> >>>>> extending Mahout's tf-idf implementation.
> >>>>>
> >>>>> I was curious to know if anyone here has already tried to do so.
> >>>>> If not, what would be your suggestion for such implementation on
> >>>>>
> >>>> Mahout?
> >>>
> >>>>
> >>>>> Arian Pasquali
> >>>>> http://about.me/arianpasquali
> >>>>>
> >>>>
> >
>

Re: word weights using BM25

Posted by Suneel Marthi <su...@gmail.com>.
@Marko, Subject: Streaming KMeans

See
http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means/18090471#18090471
for how to invoke Streaming Kmeans

Also look at examples/bin/cluster-reuters.sh for the Streaming KMeans
option.


On Wed, Sep 24, 2014 at 11:34 AM, Marko <ma...@nissatech.com> wrote:

> Hello everyone,
>
> I'm very sorry to bump in like this, I have been added to the mail list (I
> think), but it seems that I'm somehow unable to ask a question, that is, I
> asked a question full times and got no answer. I hope this way will work.
>
> I'm new to Mahout and I've been struggling with Streaming K-means for a
> while now. Is there any tutorial or example of how to use it, how to get
> results, how to call clustering function?
>
> Any help would be great,
> Thanks
>
>
> On 24.09.2014. 15:14, Arian Pasquali wrote:
>
>> Yes,
>> I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
>> current mahout's tfidf code.
>> Trying to understand how I would port that to mr.
>> I ll try to share something if I succeed.
>>
>>
>>
>>
>>
>> Arian Pasquali
>> http://about.me/arianpasquali
>>
>> 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
>>
>>  Lucene 4.x supports okapi-bm25. So it should be easy to implement.
>>>
>>> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>  Should be pretty easy. I haven't heard of anyone doing it.
>>>>
>>>> Sent from my iPhone
>>>>
>>>>  On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
>>>>>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I was wondering if would be possible to support bm25 term weighting
>>>>> extending Mahout's tf-idf implementation.
>>>>>
>>>>> I was curious to know if anyone here has already tried to do so.
>>>>> If not, what would be your suggestion for such implementation on
>>>>>
>>>> Mahout?
>>>
>>>>
>>>>> Arian Pasquali
>>>>> http://about.me/arianpasquali
>>>>>
>>>>
>

Re: word weights using BM25

Posted by Marko <ma...@nissatech.com>.
Hello everyone,

I'm very sorry to bump in like this, I have been added to the mail list 
(I think), but it seems that I'm somehow unable to ask a question, that 
is, I asked a question full times and got no answer. I hope this way 
will work.

I'm new to Mahout and I've been struggling with Streaming K-means for a 
while now. Is there any tutorial or example of how to use it, how to get 
results, how to call clustering function?

Any help would be great,
Thanks

On 24.09.2014. 15:14, Arian Pasquali wrote:
> Yes,
> I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
> current mahout's tfidf code.
> Trying to understand how I would port that to mr.
> I ll try to share something if I succeed.
>
>
>
>
>
> Arian Pasquali
> http://about.me/arianpasquali
>
> 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
>
>> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
>>
>> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>> Should be pretty easy. I haven't heard of anyone doing it.
>>>
>>> Sent from my iPhone
>>>
>>>> On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
>>> wrote:
>>>> Hi,
>>>> I was wondering if would be possible to support bm25 term weighting
>>>> extending Mahout's tf-idf implementation.
>>>>
>>>> I was curious to know if anyone here has already tried to do so.
>>>> If not, what would be your suggestion for such implementation on
>> Mahout?
>>>>
>>>> Arian Pasquali
>>>> http://about.me/arianpasquali


Re: word weights using BM25

Posted by Pat Ferrel <pa...@occamsmachete.com>.
We are moving to higher performance platforms than Hadoop mapreduce, like Spark. You can still do map/reduce style code but Mahout's not taking new Hadoop mr code.

On Oct 1, 2014, at 6:30 AM, Arian Pasquali <ar...@arianpasquali.com> wrote:

Yes Suneel,
Indeed It is in MR fashion.

What exactly do you mean when you said Mahout is not accepting any new
MapReduce code?
Do you mean for submitting a patch?
I'm sure there might be better ways to implement it, but I'm more
interesting in the results right now.

What would be your suggestion?

best





Arian Pasquali
http://about.me/arianpasquali

2014-10-01 13:10 GMT+01:00 Suneel Marthi <sm...@apache.org>:

> How did u implement BM25PartialVectorReducer and BM25Converter?? The
> present implementations for TFIDFConverter and Reducer are MR.
> Mahout is not accepting any new MapReduce code.
> 
> On Wed, Oct 1, 2014 at 7:18 AM, Arian Pasquali <ar...@arianpasquali.com>
> wrote:
> 
>> Hey guys,
>> I think it is fair to give you some feedback.
>> I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
>> term
>> score on Mahout.
>> It was straightforward using the current TFIDF implementation as an
>> example.
>> 
>> Basically what I did was implement the interface
>> org.apache.mahout.vectorizer.Weight, create a BM25Converter and
>> BM25PartialVectorReducer similar to TFIDFConverter
>> <
>> 
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
>>> 
>> and
>> TFIDFPartialVectorReducer
>> <
>> 
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
>>> 
>> respectively .
>> 
>> cheers
>> Arian
>> 
>> Arian Pasquali
>> http://about.me/arianpasquali
>> 
>> 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
>> 
>>> Yes,
>>> I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and
> the
>>> current mahout's tfidf code.
>>> Trying to understand how I would port that to mr.
>>> I ll try to share something if I succeed.
>>> 
>>> Arian Pasquali
>>> http://about.me/arianpasquali
>>> 
>>> 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
>>> 
>>>> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
>>>> 
>>>> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Should be pretty easy. I haven't heard of anyone doing it.
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Sep 23, 2014, at 18:53, Arian Pasquali <
> arian@arianpasquali.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> I was wondering if would be possible to support bm25 term
> weighting
>>>>>> extending Mahout's tf-idf implementation.
>>>>>> 
>>>>>> I was curious to know if anyone here has already tried to do so.
>>>>>> If not, what would be your suggestion for such implementation on
>>>> Mahout?
>>>>>> 
>>>>>> 
>>>>>> Arian Pasquali
>>>>>> http://about.me/arianpasquali
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 


Re: word weights using BM25

Posted by Arian Pasquali <ar...@arianpasquali.com>.
Yes Suneel,
Indeed It is in MR fashion.

What exactly do you mean when you said Mahout is not accepting any new
MapReduce code?
Do you mean for submitting a patch?
I'm sure there might be better ways to implement it, but I'm more
interesting in the results right now.

What would be your suggestion?

best





Arian Pasquali
http://about.me/arianpasquali

2014-10-01 13:10 GMT+01:00 Suneel Marthi <sm...@apache.org>:

> How did u implement BM25PartialVectorReducer and BM25Converter?? The
> present implementations for TFIDFConverter and Reducer are MR.
> Mahout is not accepting any new MapReduce code.
>
> On Wed, Oct 1, 2014 at 7:18 AM, Arian Pasquali <ar...@arianpasquali.com>
> wrote:
>
> > Hey guys,
> > I think it is fair to give you some feedback.
> > I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
> > term
> > score on Mahout.
> > It was straightforward using the current TFIDF implementation as an
> > example.
> >
> > Basically what I did was implement the interface
> > org.apache.mahout.vectorizer.Weight, create a BM25Converter and
> > BM25PartialVectorReducer similar to TFIDFConverter
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
> > >
> > and
> > TFIDFPartialVectorReducer
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
> > >
> >  respectively .
> >
> > cheers
> > Arian
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
> >
> > 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
> >
> > > Yes,
> > > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and
> the
> > > current mahout's tfidf code.
> > > Trying to understand how I would port that to mr.
> > > I ll try to share something if I succeed.
> > >
> > > Arian Pasquali
> > > http://about.me/arianpasquali
> > >
> > > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
> > >
> > >> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> > >>
> > >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> > >> wrote:
> > >>
> > >> > Should be pretty easy. I haven't heard of anyone doing it.
> > >> >
> > >> > Sent from my iPhone
> > >> >
> > >> > > On Sep 23, 2014, at 18:53, Arian Pasquali <
> arian@arianpasquali.com>
> > >> > wrote:
> > >> > >
> > >> > > Hi,
> > >> > > I was wondering if would be possible to support bm25 term
> weighting
> > >> > > extending Mahout's tf-idf implementation.
> > >> > >
> > >> > > I was curious to know if anyone here has already tried to do so.
> > >> > > If not, what would be your suggestion for such implementation on
> > >> Mahout?
> > >> > >
> > >> > >
> > >> > > Arian Pasquali
> > >> > > http://about.me/arianpasquali
> > >> >
> > >>
> > >
> > >
> >
>

Re: word weights using BM25

Posted by Suneel Marthi <sm...@apache.org>.
How did u implement BM25PartialVectorReducer and BM25Converter?? The
present implementations for TFIDFConverter and Reducer are MR.
Mahout is not accepting any new MapReduce code.

On Wed, Oct 1, 2014 at 7:18 AM, Arian Pasquali <ar...@arianpasquali.com>
wrote:

> Hey guys,
> I think it is fair to give you some feedback.
> I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
> term
> score on Mahout.
> It was straightforward using the current TFIDF implementation as an
> example.
>
> Basically what I did was implement the interface
> org.apache.mahout.vectorizer.Weight, create a BM25Converter and
> BM25PartialVectorReducer similar to TFIDFConverter
> <
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
> >
> and
> TFIDFPartialVectorReducer
> <
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
> >
>  respectively .
>
> cheers
> Arian
>
> Arian Pasquali
> http://about.me/arianpasquali
>
> 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
>
> > Yes,
> > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
> > current mahout's tfidf code.
> > Trying to understand how I would port that to mr.
> > I ll try to share something if I succeed.
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
> >
> > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
> >
> >> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> >>
> >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>
> >> > Should be pretty easy. I haven't heard of anyone doing it.
> >> >
> >> > Sent from my iPhone
> >> >
> >> > > On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
> >> > wrote:
> >> > >
> >> > > Hi,
> >> > > I was wondering if would be possible to support bm25 term weighting
> >> > > extending Mahout's tf-idf implementation.
> >> > >
> >> > > I was curious to know if anyone here has already tried to do so.
> >> > > If not, what would be your suggestion for such implementation on
> >> Mahout?
> >> > >
> >> > >
> >> > > Arian Pasquali
> >> > > http://about.me/arianpasquali
> >> >
> >>
> >
> >
>

Re: word weights using BM25

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Oct 1, 2014 at 7:52 AM, Arian Pasquali <ar...@arianpasquali.com>
wrote:

> My dataset is a collection of documents in german and I can say that the
> scores seems better compared to my TFIDF scores. Results make more sense
> now, specially my bi-grams.
>

OK.

I will take note.

Re: word weights using BM25

Posted by Arian Pasquali <ar...@arianpasquali.com>.
Hi Ted,

My dataset is a collection of documents in german and I can say that the
scores seems better compared to my TFIDF scores. Results make more sense
now, specially my bi-grams.




Arian Pasquali
http://about.me/arianpasquali

2014-10-01 13:09 GMT+01:00 Ted Dunning <te...@gmail.com>:

> Thanks so much for the feedback.  Glad to hear it was straightforward.
>
>
> But the important question is ....
>
> how did BM25 work for you?
>
>
>
> On Wed, Oct 1, 2014 at 6:18 AM, Arian Pasquali <ar...@arianpasquali.com>
> wrote:
>
> > Hey guys,
> > I think it is fair to give you some feedback.
> > I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
> > term
> > score on Mahout.
> > It was straightforward using the current TFIDF implementation as an
> > example.
> >
> > Basically what I did was implement the interface
> > org.apache.mahout.vectorizer.Weight, create a BM25Converter and
> > BM25PartialVectorReducer similar to TFIDFConverter
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
> > >
> > and
> > TFIDFPartialVectorReducer
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
> > >
> >  respectively .
> >
> > cheers
> > Arian
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
> >
> > 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
> >
> > > Yes,
> > > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and
> the
> > > current mahout's tfidf code.
> > > Trying to understand how I would port that to mr.
> > > I ll try to share something if I succeed.
> > >
> > > Arian Pasquali
> > > http://about.me/arianpasquali
> > >
> > > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
> > >
> > >> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> > >>
> > >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> > >> wrote:
> > >>
> > >> > Should be pretty easy. I haven't heard of anyone doing it.
> > >> >
> > >> > Sent from my iPhone
> > >> >
> > >> > > On Sep 23, 2014, at 18:53, Arian Pasquali <
> arian@arianpasquali.com>
> > >> > wrote:
> > >> > >
> > >> > > Hi,
> > >> > > I was wondering if would be possible to support bm25 term
> weighting
> > >> > > extending Mahout's tf-idf implementation.
> > >> > >
> > >> > > I was curious to know if anyone here has already tried to do so.
> > >> > > If not, what would be your suggestion for such implementation on
> > >> Mahout?
> > >> > >
> > >> > >
> > >> > > Arian Pasquali
> > >> > > http://about.me/arianpasquali
> > >> >
> > >>
> > >
> > >
> >
>

Re: word weights using BM25

Posted by Ted Dunning <te...@gmail.com>.
Thanks so much for the feedback.  Glad to hear it was straightforward.


But the important question is ....

how did BM25 work for you?



On Wed, Oct 1, 2014 at 6:18 AM, Arian Pasquali <ar...@arianpasquali.com>
wrote:

> Hey guys,
> I think it is fair to give you some feedback.
> I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
> term
> score on Mahout.
> It was straightforward using the current TFIDF implementation as an
> example.
>
> Basically what I did was implement the interface
> org.apache.mahout.vectorizer.Weight, create a BM25Converter and
> BM25PartialVectorReducer similar to TFIDFConverter
> <
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
> >
> and
> TFIDFPartialVectorReducer
> <
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
> >
>  respectively .
>
> cheers
> Arian
>
> Arian Pasquali
> http://about.me/arianpasquali
>
> 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
>
> > Yes,
> > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
> > current mahout's tfidf code.
> > Trying to understand how I would port that to mr.
> > I ll try to share something if I succeed.
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
> >
> > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
> >
> >> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> >>
> >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>
> >> > Should be pretty easy. I haven't heard of anyone doing it.
> >> >
> >> > Sent from my iPhone
> >> >
> >> > > On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
> >> > wrote:
> >> > >
> >> > > Hi,
> >> > > I was wondering if would be possible to support bm25 term weighting
> >> > > extending Mahout's tf-idf implementation.
> >> > >
> >> > > I was curious to know if anyone here has already tried to do so.
> >> > > If not, what would be your suggestion for such implementation on
> >> Mahout?
> >> > >
> >> > >
> >> > > Arian Pasquali
> >> > > http://about.me/arianpasquali
> >> >
> >>
> >
> >
>

Re: word weights using BM25

Posted by Arian Pasquali <ar...@arianpasquali.com>.
Hey guys,
I think it is fair to give you some feedback.
I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25> term
score on Mahout.
It was straightforward using the current TFIDF implementation as an example.

Basically what I did was implement the interface
org.apache.mahout.vectorizer.Weight, create a BM25Converter and
BM25PartialVectorReducer similar to TFIDFConverter
<https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html>
and
TFIDFPartialVectorReducer
<https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html>
 respectively .

cheers
Arian

Arian Pasquali
http://about.me/arianpasquali

2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:

> Yes,
> I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
> current mahout's tfidf code.
> Trying to understand how I would port that to mr.
> I ll try to share something if I succeed.
>
> Arian Pasquali
> http://about.me/arianpasquali
>
> 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
>
>> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
>>
>> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > Should be pretty easy. I haven't heard of anyone doing it.
>> >
>> > Sent from my iPhone
>> >
>> > > On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
>> > wrote:
>> > >
>> > > Hi,
>> > > I was wondering if would be possible to support bm25 term weighting
>> > > extending Mahout's tf-idf implementation.
>> > >
>> > > I was curious to know if anyone here has already tried to do so.
>> > > If not, what would be your suggestion for such implementation on
>> Mahout?
>> > >
>> > >
>> > > Arian Pasquali
>> > > http://about.me/arianpasquali
>> >
>>
>
>

Re: word weights using BM25

Posted by Arian Pasquali <ar...@arianpasquali.com>.
Yes,
I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
current mahout's tfidf code.
Trying to understand how I would port that to mr.
I ll try to share something if I succeed.





Arian Pasquali
http://about.me/arianpasquali

2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:

> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
>
> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Should be pretty easy. I haven't heard of anyone doing it.
> >
> > Sent from my iPhone
> >
> > > On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
> > wrote:
> > >
> > > Hi,
> > > I was wondering if would be possible to support bm25 term weighting
> > > extending Mahout's tf-idf implementation.
> > >
> > > I was curious to know if anyone here has already tried to do so.
> > > If not, what would be your suggestion for such implementation on
> Mahout?
> > >
> > >
> > > Arian Pasquali
> > > http://about.me/arianpasquali
> >
>

Re: word weights using BM25

Posted by Suneel Marthi <su...@gmail.com>.
Lucene 4.x supports okapi-bm25. So it should be easy to implement.

On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com> wrote:

> Should be pretty easy. I haven't heard of anyone doing it.
>
> Sent from my iPhone
>
> > On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
> wrote:
> >
> > Hi,
> > I was wondering if would be possible to support bm25 term weighting
> > extending Mahout's tf-idf implementation.
> >
> > I was curious to know if anyone here has already tried to do so.
> > If not, what would be your suggestion for such implementation on Mahout?
> >
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
>

Re: word weights using BM25

Posted by Ted Dunning <te...@gmail.com>.
Should be pretty easy. I haven't heard of anyone doing it.  

Sent from my iPhone

> On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com> wrote:
> 
> Hi,
> I was wondering if would be possible to support bm25 term weighting
> extending Mahout's tf-idf implementation.
> 
> I was curious to know if anyone here has already tried to do so.
> If not, what would be your suggestion for such implementation on Mahout?
> 
> 
> Arian Pasquali
> http://about.me/arianpasquali